# Binned Data

## Introduction

Scipp supports features for *binning* scattered data.

<div class="alert alert-info">

**Terminology**
    
Scipp distinguishes **histogrammed** data from **binned** data:

- Histogrammed data refers to regular dense arrays of, e.g., floating-point values with an associated bin-edge coordinate.
- Binned data refers to the precursor of histogrammed data, i.e., each bin contains a "list" of contributing events or values.
  Binned data can be converted into a histogram by computing the sum or mean over all events or values in a bin.
    
</div>

Scattered data in the context of binning refers to data values irregularly placed in, e.g., space or time.
Binning lets us:

- Map a table of position-based data to an X-Y-Z grid.
- Map a table of position-based data to an angle such as $\theta$.
- Map event time stamps to time bins.

The key feature here is that *binning does not actually histogram or resample data*.
Data is kept in its original form.
Binning provides a wrapper with a coordinate system more adequate for working with the scientific data.

## From scattered to binned data

We outline the underlying concepts based on a simple example.
The scattered raw data is represented as a table with meta data for every data point (event):

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipp as sc

np.random.seed(1) # Fixed for reproducibility

Consider a list of measurements at various "points" in space.
Here we restrict ourselves to the X-Y plane for visualization purposes:

In [None]:
N = 50
values = 10*np.random.rand(N)
data = sc.DataArray(
    data=sc.Variable(dims=['position'], unit=sc.units.counts, values=values, variances=values),
    coords={
        'position':sc.Variable(dims=['position'], values=['site-{}'.format(i) for i in range(N)]),
        'x':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N)),
        'y':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N))})
data

For every point we measured at the auxiliary coordinates `'x'` and `'y'` give the position in the X-Y plane.
These are *not* dimension-coordinates, since our measurements are *not* on a 2-D grid, but rather points with an irregular distribution.
`data` is literally a 1-D table of measurements:

In [None]:
sc.table(data)

We can plot this data:

In [None]:
sc.plot(data)

The `'position'` dimension is not a continuous dimension but essentially just a row in our table.
In practice, such a figure and this representation of data in general may therefore not be very useful.

As an alternative view of our data we can create a scatter plot.
We do this explicitly here to demonstrate how the content of `data` is connected to elements of the figure:

In [None]:
fig, ax = plt.subplots()
scatter = ax.scatter(
    x=data.coords['x'].values,
    y=data.coords['y'].values,
    c=data.values)
ax.set_xlabel('x [{}]'.format(data.coords['x'].unit))
ax.set_ylabel('y [{}]'.format(data.coords['y'].unit))
cbar = plt.colorbar(scatter)
cbar.set_label("[{}]".format(data.unit))
fig

This shows the distribution in space, but for real datasets with millions of points this may not be convenient.
Furthermore, operating with scattered data is often inconvenient and may require knowledge of the underlying representation.

We can now use `scipp.bin` to provide a more accessible wrapper for our data:

In [None]:
xbins = sc.Variable(dims=['x'], unit=sc.units.m, values=[0.1,0.5,0.9])
ybins = sc.Variable(dims=['y'], unit=sc.units.m, values=[0.1,0.3,0.5,0.7,0.9])
binned = sc.bin(data, edges=[ybins, xbins])
binned

`binned` is a 2-D data array, but it contains (a reordered copy) the original table of "unaligned" data.
Each element of `binned` is a view of a section of that table:

In [None]:
binned.values[0]

The binning procedure based on bin edges for `'x'` and `'y'` is *not* performing the actual histogramming step.
However, since its dimensions are defined by the bin-edge coordinates for `'x'` and `'y'`, we will see below that it behaves much like normal dense data for operations such as slicing.

We create another figure to better illustrate the structure of `binned`:

In [None]:
fig, ax = plt.subplots()
buffer = binned.bins.constituents['data']
scatter = ax.scatter(
    x=buffer.coords['x'].values,
    y=buffer.coords['y'].values,
    c=buffer.values)
ax.set_xlabel('x [{}]'.format(binned.coords['x'].unit))
ax.set_ylabel('y [{}]'.format(binned.coords['y'].unit))
ax.set_xticks(binned.coords['x'].values)
ax.set_yticks(binned.coords['y'].values)
ax.grid()
cbar = fig.colorbar(scatter)
cbar.set_label("[{}]".format(data.unit))
fig

This is essentially the same figure as the scatter plot for the original `data`.
The differences are:

- A "grid" (the bin edges) that is stored alongside the data.
- All points outside the limits of the specified bin edges have been dropped

`binned` can now directly be histogrammed, without the need for specifying bin boundaries.
This is done by calling `sum` of the `bins` property, summing the data array within each bin:

In [None]:
sc.plot(binned.bins.sum())

Here `sum` performs histogramming for all "binned" dimensions, in this case `x` and `y`.
The resulting values in the X-Y bins are the counts accumulated from measurements at all points falling in a given bucket.

The `plot` function automatically resamples and histograms when binned data is supplied:

In [None]:
# Defining resolution not required, but in this example with very few events it improves plot readability
sc.plot(binned, resolution=10)

## Working with binned data

### Slicing

Binned data can be sliced as usual, e.g., to create plots of subregions:

In [None]:
sc.plot(binned['x', 0].bins.sum())

Just like slicing dense variables, a slice of binned data "drops" all unaligned data falling into areas outside the slice:

In [None]:
s0 = binned['x', 0]
s1 = binned['x', 1]
print(f'total events: {sc.sum(binned.bins.sum().data)}')
print(f'events x=0:   {sc.sum(s0.bins.sum().data)}')
print(f'events x=1:   {sc.sum(s1.bins.sum().data)}')

This can provide an intuitive way of "filtering" lists of data based on some property of the list items.

### Masking

Masks can be defined for the unaligned data array, as well as the realigned wrapper.
This gives fine-grained and intuitive control, for e.g., masking invalid list entries on the one hand, and excluding regions in space on the other hand, without the need of manually determining which list entries fall into the exclusion zone.

We define two masks one in the X-Y plane and one for positions.
The position mask is added to the `masks` dict of the `bins` property:

In [None]:
x_y_mask = sc.array(
    dims=binned.dims,
    values=np.array([[True, False], [True, False], [False, False], [False, False]])
)
binned.masks['exclude'] = x_y_mask
binned.bins.masks['broken_sensor'] = binned.bins.coords['y'] > 0.6 * sc.Unit('m')

As usual, more masks can be added if required, and masks can be removed as long as no reduction operation such as summing or histogramming took place.

We can then plot the result.
The mask of the underlying unaligned data is applied during the histogram step, i.e., masked positions are excluded.
The mask of the binned wrapper is indicated in the plot and carried through the histogram step.
Make sure to compare this figure with the one we obtained earlier, before masking, and note how the values of the un-masked X-Y bins have changed due to masked positions of the underlying unaligned data:

In [None]:
sc.plot(binned, resolution=10)

### Arithmetic operations

A number of [arithmetic operations and other operations](computation.ipynb) for binned data arrays are supported.

### Manipulating bin-based and event-based metadata

#### Convert bin-based coordinate into event-based coordinate

Consider binned data as above, but with a coordinate that has no corresponding event-coord.
This could be the case, e.g., with a `time` dimension that corresponds to groups rather than bins:

In [None]:
da = binned.copy()
del da.coords['y']
del da.bins.coords['y']
da = da.rename_dims({'y':'time'})
da.coords['time'] = sc.array(dims=['time'], unit='s', values=np.arange(4))
da

If we, e.g., intend to erase the grouping dimension, we may nevertheless want to preserve the `time` information.
This can be achieved [using `bins_like`](../../generated/functions/scipp.bins_like.html#scipp.bins_like) to broadcast, for each bin, a single value to a "list" of the required size:

In [None]:
da.bins.coords['time'] = sc.bins_like(da, da.coords['time'])
sc.table(da.values[0])
sc.table(da.values[2])

### Plotting higher dimensions

On-the-fly histogramming is also supported for plotting binned data with more than 2 dimensions:

In [None]:
N = 5000
values = 10*np.random.rand(N)
data3d = sc.DataArray(
    data=sc.Variable(dims=['position'], unit=sc.units.counts, values=values, variances=values),
    coords={
        'position':sc.Variable(dims=['position'], values=['site-{}'.format(i) for i in range(N)]),
        'x':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N)),
        'y':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N)),
        'z':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N))})
zbins = sc.Variable(dims=['z'], unit=sc.units.m, values=np.linspace(0.1, 0.9, 20))
binned = sc.bin(data3d, edges=[zbins, ybins, xbins])
sc.plot(binned, resolution=10)