Computation with Binned Data#

As described in Binned Data Scipp can handle certain types of sparse or scattered data such as event data, i.e., data that cannot directly be represented as a multi-dimensional array. This could, e.g., be used to store data from an array of sensors/detectors that are read out independently, with potentially widely varying frequency.

Scipp supports two classes of operations with binned data.

  1. Bin-centric arithmetic treats every bin as an element to, e.g., apply a different scale factor to every bin.

  2. Event-centric arithmetic considers the individual events within bins. This allows for operation without the precision loss that would ensue from simply histogramming data.

Overview and Quick Reference#

Before going into a detailed explanation below we provide a quick reference:

  • Unary operations such as sin, cos, or sqrt work as normal.

  • Comparison operations such as less (<) are not supported.

  • Binary operations such as + work in principle, but usually not if both operands represent event data. In that case, see table below.

Given two data arrays a and b, equivalent operations are:

Dense data operation

Binned data equivalent

Comment

a + b

a.bins.concatenate(b)

if both a and b are event data

a - b

a.bins.concatenate(-b)

if both a and b are event data

a += b

a.bins.concatenate(b, out=a)

if both a and b are event data

a -= b

a.bins.concatenate(-b, out=a)

if both a and b are event data

sc.sum(a, 'dim')

a.bins.concat('dim')

sc.mean(a, 'dim')

not available (but see comment below)

min, max, and other similar reductions are also not available

sc.rebin(a, {dim:edges})

sc.bin(a, {dim:edges})

a.groupby('param').sum('dim')

a.group('param')

`` a.groupby(‘param’, bins=edges).sum(‘dim’)``

a.bin(param=edges)

While above we list mean, min, max, and others as “not available”, note that this refers to an equivalent function preserving data as binned data. sum, mean, min, and max all do work on binned data, but return dense data.

Concepts#

Before assigning events to bins, we can initialize them as a single long list or table. In the simplest case this table has just a single column, i.e., it is a Scipp variable:

[1]:
import numpy as np
import scipp as sc

table = sc.array(
    dims=['event'], values=[0, 1, 3, 1, 1, 1, 42, 1, 1, 1, 1, 1], dtype='float64'
)
sc.table(table)
[1]:
Data
[𝟙]
0.000
1.000
3.000
1.000
1.000
1.000
42.000
1.000
1.000
1.000
1.000
1.000

The events in the table can then be mapped into bins:

[2]:
begin = sc.array(dims=['x'], values=[0, 6, 6, 8], unit=None)
end = sc.array(dims=['x'], values=[6, 6, 8, 12], unit=None)
var = sc.bins(begin=begin, end=end, dim='event', data=table)
sc.show(var)
var
dims=('x',), shape=(4,), unit=None, variances=Falsevalues x dims=('event',), shape=(12,), unit=dimensionless, variances=Falsevalues event
[2]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (888 Bytes)
    • (x: 4)
      VariableView
      binned data [len=6, len=0, len=2, len=4]
      dim='event',
      content=Variable(dims=(event: 12), dtype=float64, unit=dimensionless)

Each element of the resulting “bin variable” references a section of the underlying table:

[3]:
sc.table(var['x', 0].value)
[3]:
Data
[𝟙]
0.000
1.000
3.000
1.000
1.000
1.000
[4]:
sc.table(var['x', 1].value)
[4]:
Data
[𝟙]
[5]:
sc.table(var['x', 2].value)
[5]:
Data
[𝟙]
42.000
1.000

Bin-centric arithmetic#

Elements of binned variables are views of slices of a variable or data array. An operation such as multiplication of a binned variable with a dense array thus computes the product of the bin (a variable view or data array view) with a scalar element of the dense array. In other words, operations between variables or data arrays broadcast dense data to the lists held in the bins:

[6]:
scale = sc.arange('x', 2.0, 6)
var *= scale
var['x', 0].values
[6]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (304 Bytes out of 352 Bytes)
    • (event: 6)
      float64
      𝟙
      0.0, 2.0, ..., 2.0, 2.0
      Values:
      array([0., 2., 6., 2., 2., 2.])
[7]:
var['x', 1].values
[7]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (256 Bytes out of 352 Bytes)
    • (event: 0)
      float64
      𝟙
      Values:
      array([], dtype=float64)
[8]:
var['x', 2].values
[8]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (272 Bytes out of 352 Bytes)
    • (event: 2)
      float64
      𝟙
      168.0, 4.0
      Values:
      array([168., 4.])

In practice scattered data requires more than one “column” of information. Typically we need at least one coordinate such as an event time stamp in addition to weights. If each scattered data point (event) corresponds to, e.g., a single detected neutron then the weight is 1. As above, we start by creating a single table containing all events:

[9]:
times = sc.array(
    dims=['event'],
    unit='us',  # micro second
    values=[0, 1, 3, 1, 1, 1, 4, 1, 1, 2, 1, 1],
    dtype='float64',
)
weights = sc.ones(dims=['event'], unit='counts', shape=[12], with_variances=True)

table = sc.DataArray(data=weights, coords={'time': times})
sc.table(table)
table
[9]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.28 KB)
    • event: 12
    • time
      (event)
      float64
      µs
      0.0, 1.0, ..., 1.0, 1.0
      Values:
      array([0., 1., 3., 1., 1., 1., 4., 1., 1., 2., 1., 1.])
    • (event)
      float64
      counts
      1.0, 1.0, ..., 1.0, 1.0
      σ = 1.0, 1.0, ..., 1.0, 1.0
      Values:
      array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

      Variances (σ²):
      array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

This table is then mapped into bins. The resulting “bin variable” can, e.g., be used as the data in a data array, and can be combined with coordinates as usual:

[10]:
var = sc.bins(begin=begin, end=end, dim='event', data=table)
a = sc.DataArray(data=var, coords={'x': sc.Variable(dims=['x'], values=np.arange(4.0))})
a
[10]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (2.49 KB)
    • x: 4
    • x
      (x)
      float64
      𝟙
      0.0, 1.0, 2.0, 3.0
      Values:
      array([0., 1., 2., 3.])
    • (x)
      DataArrayView
      binned data [len=6, len=0, len=2, len=4]
      dim='event',
      content=DataArray(
                dims=(event: 12),
                data=float64[counts],
                coords={'time':float64[µs]})

In practice we rarely use sc.bins (which requires begin and end indices for each bin and an appropriately sorted table) to create binned data. For creating binned variables sc.bins is the only option, but for binning data arrays we typically bin based on the meta data of the individual data points, i.e., the coord columns in the table of scattered data. Use scipp.bin (not sc.bins) and/or scipp.group as described in Binned Data.

In the graphical representation of the data array we can see the dense coordinate (green), and the bins (yellow):

[11]:
sc.show(a)
(dims=('x',), shape=(4,), unit=None, variances=False)values x (dims=('event',), shape=(12,), unit=counts, variances=True)variances eventvalues event timetime(dims=('event',), shape=(12,), unit=µs, variances=False)values event xx(dims=('x',), shape=(4,), unit=dimensionless, variances=False)values x

As before, each bin references a section of the underlying table:

[12]:
sc.table(a['x', 0].value)
[12]:
CoordinatesData
time [µs] [counts]
0.0001.000±1.000
1.0001.000±1.000
3.0001.000±1.000
1.0001.000±1.000
1.0001.000±1.000
1.0001.000±1.000
[13]:
sc.table(a['x', 1].value)
[13]:
CoordinatesData
time [µs] [counts]
[14]:
sc.table(a['x', 2].value)
[14]:
CoordinatesData
time [µs] [counts]
4.0001.000±1.000
1.0001.000±1.000

Event-centric arithmetic#

Complementary to the bin-centric arithmetic operations, which treat every bin as an element, we may need more fine-grained operations that treat events within bins individually. Scipp provides a number of such operations, defined in a manner to produce a result equivalent to that of a corresponding operation for histogrammed data, but preserving events and thus keeping the options of, e.g., binning to a higher resolution or filtering by meta data later. For example, addition of histogrammed data would correspond to concatenating event lists of individual bins.

The following operations are supported:

  • “Addition” of data arrays containing event data in bins. This is achieved by concatenating the underlying event lists.

  • “Subtraction” of data arrays containing event data in bins. This is performed by concatenating with a negative weight for the subtrahend.

  • “Multiplication” of a data array containing event data in bins with a data array with dense, histogrammed data. The weight of each event is scaled by the value of the corresponding bin in the histogram. Note that in contrast to the bin-centric operations the histogram may have a different (higher) resolution than the bin size of the binned event data. This operation does not match based on the coords of the bin but instead considers the more precise coord values of the individual events, i.e., the operation is performances based on the meta data column in the underlying table.

  • “Division” of a data array containing event data in bins by a data array with dense, histogrammed data. This is performed by scaling with the inverse of the denominator.

Note:

These operations, in particular multiplication and division, are only interchangeable with histogramming if the variances of the “histogram” operand are negligible. If these variances are not negligible the operation on the event data would introduce correlations in the error bars of the individual events. Scipp has no way of tracking such correlations and therefore these operations raise VariancesError to avoid silent underestimation of uncertainties. See our publication Systematic underestimation of uncertainties by widespread neutron-scattering data-reduction software for more background.

Addition#

[15]:
sc.show(a['x', 2].value)
a.bins.concatenate(a, out=a)
sc.show(a['x', 2].value)
a.bins.sum().plot()
(dims=('event',), shape=(2,), unit=counts, variances=True)variances eventvalues event timetime(dims=('event',), shape=(2,), unit=µs, variances=False)values event
(dims=('event',), shape=(4,), unit=counts, variances=True)variances eventvalues event timetime(dims=('event',), shape=(4,), unit=µs, variances=False)values event
[15]:
../../_images/user-guide_binned-data_computation_28_2.svg

Subtraction#

[16]:
zero = a.copy()
sc.show(zero['x', 2].value)
zero.bins.concatenate(-zero, out=zero)
sc.show(zero['x', 2].value)
zero.bins.sum().plot()
(dims=('event',), shape=(4,), unit=counts, variances=True)variances eventvalues event timetime(dims=('event',), shape=(4,), unit=µs, variances=False)values event
(dims=('event',), shape=(8,), unit=counts, variances=True)variances eventvalues event timetime(dims=('event',), shape=(8,), unit=µs, variances=False)values event
[16]:
../../_images/user-guide_binned-data_computation_30_2.svg

Multiplication and division#

We prepare a histogram that will be used to scale the event weights:

[17]:
time_bins = sc.array(dims=['time'], unit=sc.units.us, values=[0.0, 3.0, 6.0])
weight = sc.array(dims=['time'], values=[10.0, 3.0])
hist = sc.DataArray(data=weight, coords={'time': time_bins})
hist
[17]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.04 KB)
    • time: 2
    • time
      (time [bin-edge])
      float64
      µs
      0.0, 3.0, 6.0
      Values:
      array([0., 3., 6.])
    • (time)
      float64
      𝟙
      10.0, 3.0
      Values:
      array([10., 3.])

The helper sc.lookup provides a wrapper for a discrete “function”, given by a data array depending on a specified dimension. The binary arithmetic operators on the bins property support this function-like object:

[18]:
a.bins /= sc.lookup(func=hist, dim='time')
a.hist(time=time_bins).plot()
[18]:
../../_images/user-guide_binned-data_computation_34_0.svg

As noted above, operations that would introduce correlations due to a broadcast raise an exception. In this case, using a histogram bin value with variances for multiple events in that bin triggers such an error:

[19]:
weight = sc.array(dims=['time'], values=[10.0, 3.0], variances=[10.0, 3.0])
hist = sc.DataArray(data=weight, coords={'time': time_bins})
a.bins /= sc.lookup(func=hist, dim='time')
---------------------------------------------------------------------------
VariancesError                            Traceback (most recent call last)
Cell In[19], line 3
      1 weight = sc.array(dims=['time'], values=[10.0, 3.0], variances=[10.0, 3.0])
      2 hist = sc.DataArray(data=weight, coords={'time': time_bins})
----> 3 a.bins /= sc.lookup(func=hist, dim='time')

File ~/work/scipp/scipp/.tox/docs/lib/python3.10/site-packages/scipp/core/bins.py:206, in Bins.__itruediv__(self, lut)
    205 def __itruediv__(self, lut: Lookup) -> Bins[_O]:  # noqa: PYI034
--> 206     _cpp.buckets.scale(self._obj, _cpp.reciprocal(lut.func), lut.dim)
    207     return self

VariancesError: Cannot broadcast object with variances as this would introduce unhandled correlations. Input dimensions were:
(x: 4) variances=False
(x: 4) variances=False
() variances=False
() variances=True

See https://doi.org/10.3233/JNR-220049 for more background.