Binned Data

Introduction

Scipp supports features for binning scattered data.

Terminology

Scipp distinguishes histogrammed data from binned data:

  • Histogrammed data refers to regular dense arrays of, e.g., floating-point values with an associated bin-edge coordinate.

  • Binned data refers to the precursor of histogrammed data, i.e., each bin contains a “list” of contributing events or values. Binned data can be converted into a histogram by computing the sum or mean over all events or values in a bin.

Scattered data in the context of binning refers to data values irregularly placed in, e.g., space or time. Binning lets us:

  • Map a table of position-based data to an X-Y-Z grid.

  • Map a table of position-based data to an angle such as \(\theta\).

  • Map event time stamps to time bins.

The key feature here is that binning does not actually histogram or resample data. Data is kept in its original form. Binning provides a wrapper with a coordinate system more adequate for working with the scientific data.

From scattered to binned data

We outline the underlying concepts based on a simple example. The scattered raw data is represented as a table with meta data for every data point (event):

[1]:
import numpy as np
import matplotlib.pyplot as plt
import scipp as sc

np.random.seed(1) # Fixed for reproducibility

Consider a list of measurements at various “points” in space. Here we restrict ourselves to the X-Y plane for visualization purposes:

[2]:
N = 50
values = 10*np.random.rand(N)
data = sc.DataArray(
    data=sc.Variable(dims=['position'], unit=sc.units.counts, values=values, variances=values),
    coords={
        'position':sc.Variable(dims=['position'], values=['site-{}'.format(i) for i in range(N)]),
        'x':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N)),
        'y':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N))})
data
[2]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (3.12 KB)
    • position: 50
    • position
      (position)
      string
      site-0, site-1, ..., site-48, site-49
      Values:
      ["site-0", "site-1", "site-2", "site-3", ..., "site-46", "site-47", "site-48", "site-49"]
    • x
      (position)
      float64
      m
      0.02, 0.68, ..., 0.0, 0.62
      Values:
      array([0.01936696, 0.67883553, 0.21162812, 0.26554666, 0.49157316, 0.05336255, 0.57411761, 0.14672857, 0.58930554, 0.69975836, 0.10233443, 0.41405599, 0.69440016, 0.41417927, 0.04995346, 0.53589641, 0.66379465, 0.51488911, 0.94459476, 0.58655504, 0.90340192, 0.1374747 , 0.13927635, 0.80739129, 0.39767684, 0.1653542 , 0.92750858, 0.34776586, 0.7508121 , 0.72599799, 0.88330609, 0.62367221, 0.75094243, 0.34889834, 0.26992789, 0.89588622, 0.42809119, 0.96484005, 0.6634415 , 0.62169572, 0.11474597, 0.94948926, 0.44991213, 0.57838961, 0.4081368 , 0.23702698, 0.90337952, 0.57367949, 0.00287033, 0.61714491])
    • y
      (position)
      float64
      m
      0.33, 0.53, ..., 0.56, 0.01
      Values:
      array([0.3266449 , 0.5270581 , 0.8859421 , 0.35726976, 0.90853515, 0.62336012, 0.01582124, 0.92943723, 0.69089692, 0.99732285, 0.17234051, 0.13713575, 0.93259546, 0.69681816, 0.06600017, 0.75546305, 0.75387619, 0.92302454, 0.71152476, 0.12427096, 0.01988013, 0.02621099, 0.02830649, 0.24621107, 0.86002795, 0.53883106, 0.55282198, 0.84203089, 0.12417332, 0.27918368, 0.58575927, 0.96959575, 0.56103022, 0.01864729, 0.80063267, 0.23297427, 0.8071052 , 0.38786064, 0.86354185, 0.74712164, 0.55624023, 0.13645523, 0.05991769, 0.12134346, 0.04455188, 0.10749413, 0.22570934, 0.71298898, 0.55971698, 0.01255598])
    • (position)
      float64
      counts
      4.17, 7.2, ..., 2.88, 1.3
      σ = 2.04, 2.68, ..., 1.7, 1.14
      Values:
      array([4.17022005e+00, 7.20324493e+00, 1.14374817e-03, 3.02332573e+00, 1.46755891e+00, 9.23385948e-01, 1.86260211e+00, 3.45560727e+00, 3.96767474e+00, 5.38816734e+00, 4.19194514e+00, 6.85219500e+00, 2.04452250e+00, 8.78117436e+00, 2.73875932e-01, 6.70467510e+00, 4.17304802e+00, 5.58689828e+00, 1.40386939e+00, 1.98101489e+00, 8.00744569e+00, 9.68261576e+00, 3.13424178e+00, 6.92322616e+00, 8.76389152e+00, 8.94606664e+00, 8.50442114e-01, 3.90547832e-01, 1.69830420e+00, 8.78142503e+00, 9.83468338e-01, 4.21107625e+00, 9.57889530e+00, 5.33165285e+00, 6.91877114e+00, 3.15515631e+00, 6.86500928e+00, 8.34625672e+00, 1.82882773e-01, 7.50144315e+00, 9.88861089e+00, 7.48165654e+00, 2.80443992e+00, 7.89279328e+00, 1.03226007e+00, 4.47893526e+00, 9.08595503e+00, 2.93614148e+00, 2.87775339e+00, 1.30028572e+00])

      Variances (σ²):
      array([4.17022005e+00, 7.20324493e+00, 1.14374817e-03, 3.02332573e+00, 1.46755891e+00, 9.23385948e-01, 1.86260211e+00, 3.45560727e+00, 3.96767474e+00, 5.38816734e+00, 4.19194514e+00, 6.85219500e+00, 2.04452250e+00, 8.78117436e+00, 2.73875932e-01, 6.70467510e+00, 4.17304802e+00, 5.58689828e+00, 1.40386939e+00, 1.98101489e+00, 8.00744569e+00, 9.68261576e+00, 3.13424178e+00, 6.92322616e+00, 8.76389152e+00, 8.94606664e+00, 8.50442114e-01, 3.90547832e-01, 1.69830420e+00, 8.78142503e+00, 9.83468338e-01, 4.21107625e+00, 9.57889530e+00, 5.33165285e+00, 6.91877114e+00, 3.15515631e+00, 6.86500928e+00, 8.34625672e+00, 1.82882773e-01, 7.50144315e+00, 9.88861089e+00, 7.48165654e+00, 2.80443992e+00, 7.89279328e+00, 1.03226007e+00, 4.47893526e+00, 9.08595503e+00, 2.93614148e+00, 2.87775339e+00, 1.30028572e+00])

For every point we measured at the auxiliary coordinates 'x' and 'y' give the position in the X-Y plane. These are not dimension-coordinates, since our measurements are not on a 2-D grid, but rather points with an irregular distribution. data is literally a 1-D table of measurements:

[3]:
sc.table(data)

We can plot this data:

[4]:
sc.plot(data)

The 'position' dimension is not a continuous dimension but essentially just a row in our table. In practice, such a figure and this representation of data in general may therefore not be very useful.

As an alternative view of our data we can create a scatter plot. We do this explicitly here to demonstrate how the content of data is connected to elements of the figure:

[5]:
fig, ax = plt.subplots()
scatter = ax.scatter(
    x=data.coords['x'].values,
    y=data.coords['y'].values,
    c=data.values)
ax.set_xlabel('x [{}]'.format(data.coords['x'].unit))
ax.set_ylabel('y [{}]'.format(data.coords['y'].unit))
cbar = plt.colorbar(scatter)
cbar.set_label("[{}]".format(data.unit))
fig
[5]:
../../_images/user-guide_binned-data_binned-data_9_0.png

This shows the distribution in space, but for real datasets with millions of points this may not be convenient. Furthermore, operating with scattered data is often inconvenient and may require knowledge of the underlying representation.

We can now use scipp.bin to provide a more accessible wrapper for our data:

[6]:
xbins = sc.Variable(dims=['x'], unit=sc.units.m, values=[0.1,0.5,0.9])
ybins = sc.Variable(dims=['y'], unit=sc.units.m, values=[0.1,0.3,0.5,0.7,0.9])
binned = sc.bin(data, edges=[ybins, xbins])
binned
[6]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.88 KB)
    • y: 4
    • x: 2
    • x
      (x [bin-edge])
      float64
      m
      0.1, 0.5, 0.9
      Values:
      array([0.1, 0.5, 0.9])
    • y
      (y [bin-edge])
      float64
      m
      0.1, 0.3, 0.5, 0.7, 0.9
      Values:
      array([0.1, 0.3, 0.5, 0.7, 0.9])
    • (y, x)
      DataArrayView
      binned data [len=3, len=6, ..., len=5, len=5]
      Values:
      [<scipp.DataArray> Dimensions: Sizes[position:3, ] Coordinates: position string [dimensionless] (position) ["site-10", "site-11", "site-45"] x float64 [m] (position) [0.102334, 0.414056, 0.237027] y float64 [m] (position) [0.172341, 0.137136, 0.107494] Data: float64 [counts] (position) [4.191945, 6.852195, 4.478935] [4.191945, 6.852195, 4.478935] , <scipp.DataArray> Dimensions: Sizes[position:6, ] Coordinates: position string [dimensionless] (position) ["site-19", "site-23", "site-28", "site-29", "site-35", "site-43"] x float64 [m] (position) [0.586555, 0.807391, 0.750812, 0.725998, 0.895886, 0.578390] y float64 [m] (position) [0.124271, 0.246211, 0.124173, 0.279184, 0.232974, 0.121343] Data: float64 [counts] (position) [1.981015, 6.923226, 1.698304, 8.781425, 3.155156, 7.892793] [1.981015, 6.923226, 1.698304, 8.781425, 3.155156, 7.892793] , <scipp.DataArray> Dimensions: Sizes[position:1, ] Coordinates: position string [dimensionless] (position) ["site-3"] x float64 [m] (position) [0.265547] y float64 [m] (position) [0.357270] Data: float64 [counts] (position) [3.023326] [3.023326] , <scipp.DataArray> Dimensions: Sizes[position:0, ] Coordinates: position string [dimensionless] (position) [] x float64 [m] (position) [] y float64 [m] (position) [] Data: float64 [counts] (position) [] [] , <scipp.DataArray> Dimensions: Sizes[position:3, ] Coordinates: position string [dimensionless] (position) ["site-13", "site-25", "site-40"] x float64 [m] (position) [0.414179, 0.165354, 0.114746] y float64 [m] (position) [0.696818, 0.538831, 0.556240] Data: float64 [counts] (position) [8.781174, 8.946067, 9.888611] [8.781174, 8.946067, 9.888611] , <scipp.DataArray> Dimensions: Sizes[position:4, ] Coordinates: position string [dimensionless] (position) ["site-1", "site-8", "site-30", "site-32"] x float64 [m] (position) [0.678836, 0.589306, 0.883306, 0.750942] y float64 [m] (position) [0.527058, 0.690897, 0.585759, 0.561030] Data: float64 [counts] (position) [7.203245, 3.967675, 0.983468, 9.578895] [7.203245, 3.967675, 0.983468, 9.578895] , <scipp.DataArray> Dimensions: Sizes[position:5, ] Coordinates: position string [dimensionless] (position) ["site-2", "site-24", "site-27", "site-34", "site-36"] x float64 [m] (position) [0.211628, 0.397677, 0.347766, 0.269928, 0.428091] y float64 [m] (position) [0.885942, 0.860028, 0.842031, 0.800633, 0.807105] Data: float64 [counts] (position) [0.001144, 8.763892, 0.390548, 6.918771, 6.865009] [0.001144, 8.763892, 0.390548, 6.918771, 6.865009] , <scipp.DataArray> Dimensions: Sizes[position:5, ] Coordinates: position string [dimensionless] (position) ["site-15", "site-16", "site-38", "site-39", "site-47"] x float64 [m] (position) [0.535896, 0.663795, 0.663441, 0.621696, 0.573679] y float64 [m] (position) [0.755463, 0.753876, 0.863542, 0.747122, 0.712989] Data: float64 [counts] (position) [6.704675, 4.173048, 0.182883, 7.501443, 2.936141] [6.704675, 4.173048, 0.182883, 7.501443, 2.936141] ]

binned is a 2-D data array, but it contains (a reordered copy) the original table of “unaligned” data. Each element of binned is a view of a section of that table:

[7]:
binned.values[0]
[7]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (192 Bytes out of 1.69 KB)
    • position: 3
    • position
      (position)
      string
      site-10, site-11, site-45
      Values:
      ["site-10", "site-11", "site-45"]
    • x
      (position)
      float64
      m
      0.1, 0.41, 0.24
      Values:
      array([0.10233443, 0.41405599, 0.23702698])
    • y
      (position)
      float64
      m
      0.17, 0.14, 0.11
      Values:
      array([0.17234051, 0.13713575, 0.10749413])
    • (position)
      float64
      counts
      4.19, 6.85, 4.48
      σ = 2.05, 2.62, 2.12
      Values:
      array([4.19194514, 6.852195 , 4.47893526])

      Variances (σ²):
      array([4.19194514, 6.852195 , 4.47893526])

The binning procedure based on bin edges for 'x' and 'y' is not performing the actual histogramming step. However, since its dimensions are defined by the bin-edge coordinates for 'x' and 'y', we will see below that it behaves much like normal dense data for operations such as slicing.

We create another figure to better illustrate the structure of binned:

[8]:
fig, ax = plt.subplots()
buffer = binned.bins.constituents['data']
scatter = ax.scatter(
    x=buffer.coords['x'].values,
    y=buffer.coords['y'].values,
    c=buffer.values)
ax.set_xlabel('x [{}]'.format(binned.coords['x'].unit))
ax.set_ylabel('y [{}]'.format(binned.coords['y'].unit))
ax.set_xticks(binned.coords['x'].values)
ax.set_yticks(binned.coords['y'].values)
ax.grid()
cbar = fig.colorbar(scatter)
cbar.set_label("[{}]".format(data.unit))
fig
[8]:
../../_images/user-guide_binned-data_binned-data_15_0.png

This is essentially the same figure as the scatter plot for the original data. The differences are:

  • A “grid” (the bin edges) that is stored alongside the data.

  • All points outside the limits of the specified bin edges have been dropped

binned can now directly be histogrammed, without the need for specifying bin boundaries. This is done by calling sum of the bins property, summing the data array within each bin:

[9]:
sc.plot(binned.bins.sum())

Here sum performs histogramming for all “binned” dimensions, in this case x and y. The resulting values in the X-Y bins are the counts accumulated from measurements at all points falling in a given bucket.

The plot function automatically resamples and histograms when binned data is supplied:

[10]:
# Defining resolution not required, but in this example with very few events it improves plot readability
sc.plot(binned, resolution=10)

Working with binned data

Slicing

Binned data can be sliced as usual, e.g., to create plots of subregions:

[11]:
sc.plot(binned['x', 0].bins.sum())

Just like slicing dense variables, a slice of binned data “drops” all unaligned data falling into areas outside the slice:

[12]:
s0 = binned['x', 0]
s1 = binned['x', 1]
print(f'total events: {sc.sum(binned.bins.sum().data)}')
print(f'events x=0:   {sc.sum(s0.bins.sum().data)}')
print(f'events x=1:   {sc.sum(s1.bins.sum().data)}')
total events: <scipp.Variable> ()    float64         [counts]  [142.765010]  [142.765010]
events x=0:   <scipp.Variable> ()    float64         [counts]  [69.101617]  [69.101617]
events x=1:   <scipp.Variable> ()    float64         [counts]  [73.663394]  [73.663394]

This can provide an intuitive way of “filtering” lists of data based on some property of the list items.

Masking

Masks can be defined for the unaligned data array, as well as the realigned wrapper. This gives fine-grained and intuitive control, for e.g., masking invalid list entries on the one hand, and excluding regions in space on the other hand, without the need of manually determining which list entries fall into the exclusion zone.

We define two masks one in the X-Y plane and one for positions. The position mask is added to the masks dict of the bins property:

[13]:
x_y_mask = sc.array(
    dims=binned.dims,
    values=np.array([[True, False], [True, False], [False, False], [False, False]])
)
binned.masks['exclude'] = x_y_mask
binned.bins.masks['broken_sensor'] = binned.bins.coords['y'] > 0.6 * sc.Unit('m')

As usual, more masks can be added if required, and masks can be removed as long as no reduction operation such as summing or histogramming took place.

We can then plot the result. The mask of the underlying unaligned data is applied during the histogram step, i.e., masked positions are excluded. The mask of the binned wrapper is indicated in the plot and carried through the histogram step. Make sure to compare this figure with the one we obtained earlier, before masking, and note how the values of the un-masked X-Y bins have changed due to masked positions of the underlying unaligned data:

[14]:
sc.plot(binned, resolution=10)

Arithmetic operations

A number of arithmetic operations and other operations for binned data arrays are supported.

Manipulating bin-based and event-based metadata

Convert bin-based coordinate into event-based coordinate

Consider binned data as above, but with a coordinate that has no corresponding event-coord. This could be the case, e.g., with a time dimension that corresponds to groups rather than bins:

[15]:
da = binned.copy()
del da.coords['y']
del da.bins.coords['y']
da = da.rename_dims({'y':'time'})
da.coords['time'] = sc.array(dims=['time'], unit='s', values=np.arange(4))
da
[15]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.69 KB)
    • time: 4
    • x: 2
    • time
      (time)
      int64
      s
      0, 1, 2, 3
      Values:
      array([0, 1, 2, 3])
    • x
      (x [bin-edge])
      float64
      m
      0.1, 0.5, 0.9
      Values:
      array([0.1, 0.5, 0.9])
    • (time, x)
      DataArrayView
      binned data [len=3, len=6, ..., len=5, len=5]
      Values:
      [<scipp.DataArray> Dimensions: Sizes[position:3, ] Coordinates: position string [dimensionless] (position) ["site-10", "site-11", "site-45"] x float64 [m] (position) [0.102334, 0.414056, 0.237027] Data: float64 [counts] (position) [4.191945, 6.852195, 4.478935] [4.191945, 6.852195, 4.478935] Masks: broken_sensor bool [dimensionless] (position) [False, False, False] , <scipp.DataArray> Dimensions: Sizes[position:6, ] Coordinates: position string [dimensionless] (position) ["site-19", "site-23", "site-28", "site-29", "site-35", "site-43"] x float64 [m] (position) [0.586555, 0.807391, 0.750812, 0.725998, 0.895886, 0.578390] Data: float64 [counts] (position) [1.981015, 6.923226, 1.698304, 8.781425, 3.155156, 7.892793] [1.981015, 6.923226, 1.698304, 8.781425, 3.155156, 7.892793] Masks: broken_sensor bool [dimensionless] (position) [False, False, False, False, False, False] , <scipp.DataArray> Dimensions: Sizes[position:1, ] Coordinates: position string [dimensionless] (position) ["site-3"] x float64 [m] (position) [0.265547] Data: float64 [counts] (position) [3.023326] [3.023326] Masks: broken_sensor bool [dimensionless] (position) [False] , <scipp.DataArray> Dimensions: Sizes[position:0, ] Coordinates: position string [dimensionless] (position) [] x float64 [m] (position) [] Data: float64 [counts] (position) [] [] Masks: broken_sensor bool [dimensionless] (position) [] , <scipp.DataArray> Dimensions: Sizes[position:3, ] Coordinates: position string [dimensionless] (position) ["site-13", "site-25", "site-40"] x float64 [m] (position) [0.414179, 0.165354, 0.114746] Data: float64 [counts] (position) [8.781174, 8.946067, 9.888611] [8.781174, 8.946067, 9.888611] Masks: broken_sensor bool [dimensionless] (position) [True, False, False] , <scipp.DataArray> Dimensions: Sizes[position:4, ] Coordinates: position string [dimensionless] (position) ["site-1", "site-8", "site-30", "site-32"] x float64 [m] (position) [0.678836, 0.589306, 0.883306, 0.750942] Data: float64 [counts] (position) [7.203245, 3.967675, 0.983468, 9.578895] [7.203245, 3.967675, 0.983468, 9.578895] Masks: broken_sensor bool [dimensionless] (position) [False, True, False, False] , <scipp.DataArray> Dimensions: Sizes[position:5, ] Coordinates: position string [dimensionless] (position) ["site-2", "site-24", "site-27", "site-34", "site-36"] x float64 [m] (position) [0.211628, 0.397677, 0.347766, 0.269928, 0.428091] Data: float64 [counts] (position) [0.001144, 8.763892, 0.390548, 6.918771, 6.865009] [0.001144, 8.763892, 0.390548, 6.918771, 6.865009] Masks: broken_sensor bool [dimensionless] (position) [True, True, True, True, True] , <scipp.DataArray> Dimensions: Sizes[position:5, ] Coordinates: position string [dimensionless] (position) ["site-15", "site-16", "site-38", "site-39", "site-47"] x float64 [m] (position) [0.535896, 0.663795, 0.663441, 0.621696, 0.573679] Data: float64 [counts] (position) [6.704675, 4.173048, 0.182883, 7.501443, 2.936141] [6.704675, 4.173048, 0.182883, 7.501443, 2.936141] Masks: broken_sensor bool [dimensionless] (position) [True, True, True, True, True] ]
    • exclude
      (time, x)
      bool
      True, False, ..., False, False
      Values:
      array([[ True, False], [ True, False], [False, False], [False, False]])

If we, e.g., intend to erase the grouping dimension, we may nevertheless want to preserve the time information. This can be achieved using ``bins_like` <../../generated/functions/scipp.bins_like.html#scipp.bins_like>`__ to broadcast, for each bin, a single value to a “list” of the required size:

[16]:
da.bins.coords['time'] = sc.bins_like(da, da.coords['time'])
sc.table(da.values[0])
sc.table(da.values[2])

Plotting higher dimensions

On-the-fly histogramming is also supported for plotting binned data with more than 2 dimensions:

[17]:
N = 5000
values = 10*np.random.rand(N)
data3d = sc.DataArray(
    data=sc.Variable(dims=['position'], unit=sc.units.counts, values=values, variances=values),
    coords={
        'position':sc.Variable(dims=['position'], values=['site-{}'.format(i) for i in range(N)]),
        'x':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N)),
        'y':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N)),
        'z':sc.Variable(dims=['position'], unit=sc.units.m, values=np.random.rand(N))})
zbins = sc.Variable(dims=['z'], unit=sc.units.m, values=np.linspace(0.1, 0.9, 20))
binned = sc.bin(data3d, edges=[zbins, ybins, xbins])
sc.plot(binned, resolution=10)