Binned Data#

Introduction#

Scipp supports features for binning scattered data.

Terminology

Scipp distinguishes histogrammed data from binned data:

  • Histogrammed data refers to regular dense arrays of, e.g., floating-point values with an associated bin-edge coordinate.

  • Binned data refers to the precursor of histogrammed data, i.e., each bin contains a “list” of contributing events or values. Binned data can be converted into a histogram by computing the sum or mean over all events or values in a bin.

Scattered data in the context of binning refers to data values irregularly placed in, e.g., space or time. Binning lets us:

  • Map a table of position-based data to an X-Y-Z grid.

  • Map a table of position-based data to an angle such as \(\theta\).

  • Map event time stamps to time bins.

The key feature here is that binning does not actually histogram or resample data. Data is kept in its original form. Binning provides a wrapper with a coordinate system more adequate for working with the scientific data.

From scattered to binned data#

We outline the underlying concepts based on a simple example. The scattered raw data is represented as a table with meta data for every data point (event):

[1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import scipp as sc

np.random.seed(1)  # Fixed for reproducibility

Consider a list of measurements at various “points” in space. Here we restrict ourselves to the X-Y plane for visualization purposes:

[2]:
N = 50
values = 10 * np.random.rand(N)
table = sc.DataArray(
    data=sc.array(dims=['row'], unit=sc.units.counts, values=values, variances=values),
    coords={
        'position': sc.array(
            dims=['row'], values=[f'site-{i}' for i in range(N)]
        ),
        'x': sc.array(dims=['row'], unit='m', values=np.random.rand(N)),
        'y': sc.array(dims=['row'], unit='m', values=np.random.rand(N)),
    },
)
table
[2]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (3.79 KB)
    • row: 50
    • position
      (row)
      string
      site-0, site-1, ..., site-48, site-49
      Values:
      ["site-0", "site-1", ..., "site-48", "site-49"]
    • x
      (row)
      float64
      m
      0.019, 0.679, ..., 0.003, 0.617
      Values:
      array([0.01936696, 0.67883553, 0.21162812, 0.26554666, 0.49157316, 0.05336255, 0.57411761, 0.14672857, 0.58930554, 0.69975836, 0.10233443, 0.41405599, 0.69440016, 0.41417927, 0.04995346, 0.53589641, 0.66379465, 0.51488911, 0.94459476, 0.58655504, 0.90340192, 0.1374747 , 0.13927635, 0.80739129, 0.39767684, 0.1653542 , 0.92750858, 0.34776586, 0.7508121 , 0.72599799, 0.88330609, 0.62367221, 0.75094243, 0.34889834, 0.26992789, 0.89588622, 0.42809119, 0.96484005, 0.6634415 , 0.62169572, 0.11474597, 0.94948926, 0.44991213, 0.57838961, 0.4081368 , 0.23702698, 0.90337952, 0.57367949, 0.00287033, 0.61714491])
    • y
      (row)
      float64
      m
      0.327, 0.527, ..., 0.560, 0.013
      Values:
      array([0.3266449 , 0.5270581 , 0.8859421 , 0.35726976, 0.90853515, 0.62336012, 0.01582124, 0.92943723, 0.69089692, 0.99732285, 0.17234051, 0.13713575, 0.93259546, 0.69681816, 0.06600017, 0.75546305, 0.75387619, 0.92302454, 0.71152476, 0.12427096, 0.01988013, 0.02621099, 0.02830649, 0.24621107, 0.86002795, 0.53883106, 0.55282198, 0.84203089, 0.12417332, 0.27918368, 0.58575927, 0.96959575, 0.56103022, 0.01864729, 0.80063267, 0.23297427, 0.8071052 , 0.38786064, 0.86354185, 0.74712164, 0.55624023, 0.13645523, 0.05991769, 0.12134346, 0.04455188, 0.10749413, 0.22570934, 0.71298898, 0.55971698, 0.01255598])
    • (row)
      float64
      counts
      4.170, 7.203, ..., 2.878, 1.300
      σ = 2.042, 2.684, ..., 1.696, 1.140
      Values:
      array([4.17022005e+00, 7.20324493e+00, 1.14374817e-03, 3.02332573e+00, 1.46755891e+00, 9.23385948e-01, 1.86260211e+00, 3.45560727e+00, 3.96767474e+00, 5.38816734e+00, 4.19194514e+00, 6.85219500e+00, 2.04452250e+00, 8.78117436e+00, 2.73875932e-01, 6.70467510e+00, 4.17304802e+00, 5.58689828e+00, 1.40386939e+00, 1.98101489e+00, 8.00744569e+00, 9.68261576e+00, 3.13424178e+00, 6.92322616e+00, 8.76389152e+00, 8.94606664e+00, 8.50442114e-01, 3.90547832e-01, 1.69830420e+00, 8.78142503e+00, 9.83468338e-01, 4.21107625e+00, 9.57889530e+00, 5.33165285e+00, 6.91877114e+00, 3.15515631e+00, 6.86500928e+00, 8.34625672e+00, 1.82882773e-01, 7.50144315e+00, 9.88861089e+00, 7.48165654e+00, 2.80443992e+00, 7.89279328e+00, 1.03226007e+00, 4.47893526e+00, 9.08595503e+00, 2.93614148e+00, 2.87775339e+00, 1.30028572e+00])

      Variances (σ²):
      array([4.17022005e+00, 7.20324493e+00, 1.14374817e-03, 3.02332573e+00, 1.46755891e+00, 9.23385948e-01, 1.86260211e+00, 3.45560727e+00, 3.96767474e+00, 5.38816734e+00, 4.19194514e+00, 6.85219500e+00, 2.04452250e+00, 8.78117436e+00, 2.73875932e-01, 6.70467510e+00, 4.17304802e+00, 5.58689828e+00, 1.40386939e+00, 1.98101489e+00, 8.00744569e+00, 9.68261576e+00, 3.13424178e+00, 6.92322616e+00, 8.76389152e+00, 8.94606664e+00, 8.50442114e-01, 3.90547832e-01, 1.69830420e+00, 8.78142503e+00, 9.83468338e-01, 4.21107625e+00, 9.57889530e+00, 5.33165285e+00, 6.91877114e+00, 3.15515631e+00, 6.86500928e+00, 8.34625672e+00, 1.82882773e-01, 7.50144315e+00, 9.88861089e+00, 7.48165654e+00, 2.80443992e+00, 7.89279328e+00, 1.03226007e+00, 4.47893526e+00, 9.08595503e+00, 2.93614148e+00, 2.87775339e+00, 1.30028572e+00])

For every point we measured at the auxiliary coordinates 'x' and 'y' give the position in the X-Y plane. These are not dimension-coordinates, since our measurements are not on a 2-D grid, but rather points with an irregular distribution. data is literally a 1-D table of measurements:

[3]:
sc.table(table)
[3]:
CoordinatesData
positionx [m]y [m] [counts]
site-00.0190.3274.170±2.042
site-10.6790.5277.203±2.684
site-20.2120.8860.001±0.034
site-30.2660.3573.023±1.739
site-40.4920.9091.468±1.211
site-50.0530.6230.923±0.961
site-60.5740.0161.863±1.365
site-70.1470.9293.456±1.859
site-80.5890.6913.968±1.992
site-90.7000.9975.388±2.321
............
site-400.1150.5569.889±3.145
site-410.9490.1367.482±2.735
site-420.4500.0602.804±1.675
site-430.5780.1217.893±2.809
site-440.4080.0451.032±1.016
site-450.2370.1074.479±2.116
site-460.9030.2269.086±3.014
site-470.5740.7132.936±1.714
site-480.0030.5602.878±1.696
site-490.6170.0131.300±1.140

We can plot this data:

[4]:
table.plot()
[4]:
../../_images/user-guide_binned-data_binned-data_7_0.svg

The 'row' dimension is not a continuous dimension but essentially just a row in our table. In practice, such a figure and this representation of data in general may therefore not be very useful.

As an alternative view of our data we can create a scatter plot. We do this explicitly here to demonstrate how the content of data is connected to elements of the figure:

[5]:
fig, ax = plt.subplots(dpi=96)
scatter = ax.scatter(
    x=table.coords['x'].values, y=table.coords['y'].values, c=table.values
)
ax.set_xlabel('x [{}]'.format(table.coords['x'].unit))
ax.set_ylabel('y [{}]'.format(table.coords['y'].unit))
cbar = plt.colorbar(scatter)
cbar.set_label(f"[{table.unit}]")
../../_images/user-guide_binned-data_binned-data_9_0.png

This shows the distribution in space, but for real datasets with millions of points this is not practical. Furthermore, operating with scattered data is often inconvenient and may require knowledge of the underlying representation.

We can now use scipp.bin to provide a more accessible wrapper for our data:

[6]:
binned = table.bin(y=4, x=2)
binned
[6]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (5.34 KB)
    • y: 4
    • x: 2
    • x
      (x [bin-edge])
      float64
      m
      0.003, 0.484, 0.965
      Values:
      array([0.00287033, 0.48385519, 0.96484005])
    • y
      (y [bin-edge])
      float64
      m
      0.013, 0.259, 0.505, 0.751, 0.997
      Values:
      array([0.01255598, 0.2587477 , 0.50493942, 0.75113113, 0.99732285])
    • (y, x)
      DataArrayView
      binned data [len=9, len=10, ..., len=6, len=8]
      dim='row',
      content=DataArray(
                dims=(row: 50),
                data=float64[counts],
                coords={'position':string, 'x':float64[m], 'y':float64[m]})

binned is a 2-D data array, but it contains (a reordered copy) the original table of “unaligned” data. Each element of binned is a view of a section of that table:

[7]:
binned.values[0]
[7]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.92 KB out of 3.79 KB)
    • row: 9
    • position
      (row)
      string
      site-10, site-11, ..., site-44, site-45
      Values:
      ["site-10", "site-11", ..., "site-44", "site-45"]
    • x
      (row)
      float64
      m
      0.102, 0.414, ..., 0.408, 0.237
      Values:
      array([0.10233443, 0.41405599, 0.04995346, 0.1374747 , 0.13927635, 0.34889834, 0.44991213, 0.4081368 , 0.23702698])
    • y
      (row)
      float64
      m
      0.172, 0.137, ..., 0.045, 0.107
      Values:
      array([0.17234051, 0.13713575, 0.06600017, 0.02621099, 0.02830649, 0.01864729, 0.05991769, 0.04455188, 0.10749413])
    • (row)
      float64
      counts
      4.192, 6.852, ..., 1.032, 4.479
      σ = 2.047, 2.618, ..., 1.016, 2.116
      Values:
      array([4.19194514, 6.852195 , 0.27387593, 9.68261576, 3.13424178, 5.33165285, 2.80443992, 1.03226007, 4.47893526])

      Variances (σ²):
      array([4.19194514, 6.852195 , 0.27387593, 9.68261576, 3.13424178, 5.33165285, 2.80443992, 1.03226007, 4.47893526])

The binning procedure based on bin edges for 'x' and 'y' is not performing the actual histogramming step. However, since its dimensions are defined by the bin-edge coordinates for 'x' and 'y', we will see below that it behaves much like normal dense data for operations such as slicing.

We create another figure to better illustrate the structure of binned:

[8]:
fig, ax = plt.subplots(dpi=96)
buffer = binned.bins.constituents['data']
scatter = ax.scatter(
    x=buffer.coords['x'].values, y=buffer.coords['y'].values, c=buffer.values
)
ax.set_xlabel('x [{}]'.format(binned.coords['x'].unit))
ax.set_ylabel('y [{}]'.format(binned.coords['y'].unit))
ax.set_xticks(binned.coords['x'].values)
ax.set_yticks(binned.coords['y'].values)
ax.grid()
cbar = fig.colorbar(scatter)
cbar.set_label(f"[{table.unit}]")
../../_images/user-guide_binned-data_binned-data_15_0.png

This is essentially the same figure as the scatter plot for the original data. The differences are:

  • A “grid” (the bin edges) that is stored alongside the data.

  • All points outside the limits of the specified bin edges have been dropped

binned can now directly be histogrammed, without the need for specifying bin boundaries:

[9]:
binned.hist().plot()
[9]:
../../_images/user-guide_binned-data_binned-data_17_0.svg

Note that calling binned.hist() is equivalent to binned.bins.sum(), summing the data array within each bin. Here sum performs histogramming for all “binned” dimensions, in this case x and y. The resulting values in the X-Y bins are the counts accumulated from measurements at all points falling in a given bin.

We can explicitly histogram with a different bin count the binned data for a higher-resolution plot:

[10]:
binned.hist(y=10, x=10).plot()
[10]:
../../_images/user-guide_binned-data_binned-data_19_0.svg

Note

Unless Plopp is used for plotting (as in the Scipp documentation, but not enabled by default yet), plot() will automatically histogram and resample data. Automatic resampling by plot() will be removed in the near future, due to the inconsistent results. In the future explicit binning will be required, and we recommend explicit histogramming already now, to easy the transition in the future.

Working with binned data#

Slicing#

Binned data can be sliced as usual, e.g., to create plots of subregions:

[11]:
binned['x', 0].hist(y=10).plot()
[11]:
../../_images/user-guide_binned-data_binned-data_22_0.svg

Just like slicing dense variables, a slice of binned data “drops” all scattered data falling into areas outside the slice:

[12]:
s0 = binned['x', 0]
s1 = binned['x', 1]
print(f'total events: {binned.sum().value}')
print(f'events x=0:   {s0.sum().value}')
print(f'events x=1:   {s1.sum().value}')
total events: 233.48779981759884
events x=0:   102.78766950179809
events x=1:   130.7001303158007

This can provide an intuitive way of “filtering” lists of data based on some property of the list items.

Masking#

Masks can be defined for the scattered data array, as well as the bin wrapper. This gives fine-grained and intuitive control, for e.g., masking invalid list entries on the one hand, and excluding regions in space on the other hand, without the need of manually determining which list entries fall into the exclusion zone.

We define two masks one in the X-Y plane and one for positions. The position mask is added to the masks dict of the bins property:

[13]:
binned = binned.bin(y=10, x=10)
x_y_mask = (binned.coords['x'][1:] < 0.5 * sc.Unit('m')) & (
    binned.coords['y'][1:] < 0.5 * sc.Unit('m')
)
binned.masks['exclude'] = x_y_mask
binned.bins.masks['broken_sensor'] = binned.bins.coords['y'] > 0.6 * sc.Unit('m')

As usual, more masks can be added if required, and masks can be removed as long as no reduction operation such as summing or histogramming took place.

We can then plot the result. The mask of the underlying scattered data is applied during the histogram step, i.e., masked positions are excluded. The mask of the binned wrapper is indicated in the plot and carried through the histogram step, provided that the binning is not changed. Make sure to compare this figure with the one we obtained earlier, before masking, and note how the values of the un-masked X-Y bins have changed due to masked positions of the underlying unaligned data:

[14]:
binned.hist().plot()
[14]:
../../_images/user-guide_binned-data_binned-data_29_0.svg

Arithmetic operations#

A number of arithmetic operations and other operations for binned data arrays are supported.

Manipulating bin-based and event-based metadata#

Convert bin-based coordinate into event-based coordinate#

Consider binned data as above, but with a coordinate that has no corresponding event-coord. This could be the case, e.g., with a time dimension that corresponds to groups rather than bins:

[15]:
da = table.bin(y=4, x=2)
del da.coords['y']
del da.bins.coords['y']
da = da.rename_dims({'y': 'time'})
da.coords['time'] = sc.arange('time', 4, unit='s')
da
[15]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (4.70 KB)
    • time: 4
    • x: 2
    • time
      (time)
      int64
      s
      0, 1, 2, 3
      Values:
      array([0, 1, 2, 3])
    • x
      (x [bin-edge])
      float64
      m
      0.003, 0.484, 0.965
      Values:
      array([0.00287033, 0.48385519, 0.96484005])
    • (time, x)
      DataArrayView
      binned data [len=9, len=10, ..., len=6, len=8]
      dim='row',
      content=DataArray(
                dims=(row: 50),
                data=float64[counts],
                coords={'position':string, 'x':float64[m]})

If we, e.g., intend to erase the grouping dimension, we may nevertheless want to preserve the time information. This can be achieved using bins_like to broadcast, for each bin, a single value to a “list” of the required size:

[16]:
da.bins.coords['time'] = sc.bins_like(da, da.coords['time'])
sc.table(da.values[0])
[16]:
CoordinatesData
positiontime [s]x [m] [counts]
site-1000.1024.192±2.047
site-1100.4146.852±2.618
site-1400.0500.274±0.523
site-2100.1379.683±3.112
site-2200.1393.134±1.770
site-3300.3495.332±2.309
site-4200.4502.804±1.675
site-4400.4081.032±1.016
site-4500.2374.479±2.116

Note the different values for the time coordinate above and below:

[17]:
sc.table(da.values[2])
[17]:
CoordinatesData
positiontime [s]x [m] [counts]
site-010.0194.170±2.042
site-310.2663.023±1.739