GroupBy

“Group by” refers to an implementation of the “split-apply-combine” approach known from pandas and xarray. Scipp currently supports only a limited number of operations that can be applied.

Grouping based on label values

Suppose we have measured data for a number of parameter values, potentially repeating measurements with the same parameter multiple times:

[1]:
import numpy as np
import scipp as sc

np.random.seed(0)
[2]:
param = sc.Variable(dims=['x'], values=[1,3,1,1,5,3])
values = sc.Variable(dims=['x', 'y'], values=np.random.rand(6,16))
values += 1.0 + param

If we store this data as a data array we obtain the following plot:

[3]:
data = sc.DataArray(
    values,
    coords={
        'x': sc.Variable(dims=['x'], values=np.arange(6)),
        'y': sc.Variable(dims=['y'], values=np.arange(16))
    })
sc.plot(data)

Note that we chose the “measured” values such that the three distinct values of the underlying parameter are visible. We can now use the split-apply-combine mechanism to transform our data into a more useful representation. We start by storing the parameter values (or any value to be used for grouping) as a non-dimension coordinate:

[4]:
data.coords['param'] = param

Next, we call scipp.groupby to split the data and call mean on each of the groups:

[5]:
grouped = sc.groupby(data, group='param').mean('x')
sc.plot(grouped)

Apart from mean, groupby also supports sum, concat, and more. See GroupByDataArray and GroupByDataset for a full list.

Grouping based on binned label values

Grouping based on non-dimension coordinate values (also known as labels) is most useful when labels are strings or integers. If labels are floating-point values or cover a wide range, it is more convenient to group values into bins, i.e., all values within certain bounds are mapped into the same group. We modify the above example to use a contiuously-valued parameter:

[6]:
param = sc.Variable(dims=['x'], values=np.random.rand(16))
values = sc.Variable(dims=['x', 'y'], values=np.random.rand(16,16))
values += 1.0 + 5.0*param
[7]:
data = sc.DataArray(
    values,
    coords={
        'x': sc.Variable(dims=['x'], values=np.arange(16)),
        'y': sc.Variable(dims=['y'], values=np.arange(16))
    })
sc.plot(data)

We create a variable defining the desired binning:

[8]:
bins = sc.Variable(dims=["z"], values=np.linspace(0.0, 1.0, 10))

As before, we can now use groupby and mean to transform the data:

[9]:
data.coords['param'] = param
grouped = sc.groupby(data, group='param', bins=bins).mean('x')
sc.plot(grouped)

The values in the white rows are NaN. This is the result of empty bins, which do not have a meaningful mean value.

Alternatively, grouping can be done based on groups defined as Variables rather than strings. This, however, requires bins to be specified, since bins define the new dimension label.

[10]:
grouped = sc.groupby(data, group=param, bins=bins).mean('x') # note the lack of quotes around param!
sc.plot(grouped)

Usage examples

Filtering a variable using groupby.copy

Apart from reduction operations discussed above, groupby also supports copy, which allows us to extract a group without changes. We can use this, e.g., to filter data. This can be used for filtering variables:

[11]:
var = sc.array(dims=['x'], values=np.random.rand(100))
select = var < 0.5 * sc.Unit('')

We proceed as follows:

  1. Create a helper data array with a dummy coord that will be used to group the data elements.

  2. Call groupby, grouping by the dummy coord. Here select contains two distinct values, False and True, so groupby returns an object with two groups.

  3. Pass 1 to copy to extract the second group (group indices start at 0) which contains all elements where the dummy coord value is True.

  4. Finally, the data property returns only the filtered variable without the temporary coords that were required for groupby.

[12]:
helper = sc.DataArray(var, coords={'dummy':select})
grouped = sc.groupby(helper, group='dummy')
filtered_var = grouped.copy(1).data
filtered_var
[12]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (448 Bytes)
    • (x: 56)
      float64
      0.328, 0.240, ..., 0.449, 0.304
      Values:
      array([0.3277204 , 0.24002027, 0.16053882, 0.45813883, 0.45722345, 0.15941446, 0.39843426, 0.06271295, 0.42403225, 0.25868407, 0.03330463, 0.35536885, 0.35670689, 0.0163285 , 0.18523233, 0.4012595 , 0.09961493, 0.4541624 , 0.32670088, 0.23274413, 0.03307459, 0.01560606, 0.42879572, 0.06807407, 0.25194099, 0.22116092, 0.25319119, 0.13105523, 0.01203622, 0.1154843 , 0.4090541 , 0.16295443, 0.49030535, 0.06530421, 0.2883985 , 0.24141862, 0.24606318, 0.42408899, 0.28705152, 0.41485687, 0.36054556, 0.04600731, 0.23262699, 0.34851937, 0.29655627, 0.24942004, 0.10590615, 0.23342026, 0.05835636, 0.2724369 , 0.3790569 , 0.37429618, 0.23780724, 0.1718531 , 0.44929165, 0.30446841])

Note that we can also avoid the named helpers helper and grouped and write:

[13]:
filtered_var = sc.groupby(sc.DataArray(var, coords={'dummy':select}), group='dummy').copy(1).data