# Histogramming, grouping, and binning#

## Overview#

Histogramming (see sc.hist), grouping (using sc.groupby), and binning (see Binned data) all serve similar but slightly different purposes. Picking the optimal one of the three for a particular application may yield more natural code and better performance. Let us start by an example. Consider a table of scattered measurements:

[1]:

import numpy as np
import scipp as sc

N = 5000
values = 10 * np.random.rand(N)
table = sc.DataArray(
data=sc.array(
dims=['position'], unit=sc.units.counts, values=values, variances=values
),
coords={
'x': sc.array(dims=['position'], unit='m', values=np.random.rand(N)),
'y': sc.array(dims=['position'], unit='m', values=np.random.rand(N)),
},
)
table.values *= 1.0 / np.exp(5.0 * table.coords['x'].values)
sc.table(table['position', :5])

[1]:

CoordinatesData
x [m]y [m] [counts]
0.4760.8520.547±2.430
0.7460.1290.119±2.227
0.3660.6360.701±2.091
0.8820.2300.053±2.087
0.6990.0170.282±3.047

We may now be interested in the total intensity (counts) as a function of 'x'. There are three ways to do this:

[2]:

xbins = sc.linspace('x', 0, 1, num=40, unit='m')
ds = sc.Dataset(
{
'histogram': table.hist(x=xbins),
'groupby': table.groupby('x', bins=xbins).sum('position'),
'bin': table.bin(x=xbins).bins.sum(),
}
)
ds.plot()

/home/runner/work/scipp/scipp/.tox/docs/lib/python3.10/site-packages/matplotlib/font_manager.py:980: EncodingWarning: 'encoding' argument not specified
with open(filename) as fh:

[2]:


In the above plot we can only see a single line, since the three solutions yield exactly the same result (neglecting floating-point rounding errors):

• hist sorts data points into ‘x’ bins, summing immediately.

• groupby groups by ‘x’ and then sums (on-the-fly) all data points falling in the same ‘x’ bin.

• bin sorts data points into ‘x’ bins. Summing all rows in a bin yields the same result as grouping and summing directly.

So in this case we get equivalent results, but the application areas differ, as described in more detail in the following sections.

## Histogramming#

scipp.hist directly sums the data and is efficient. Limitations are:

• When histogramming in more than one dimension, the implementation uses sc.bin internally, which may be less efficient and uses more memory.

• Can only apply “sum” or “nansum” to accumulate into a bin. scipp.nanhist is currently implemented differently and uses sc.bin internally. It therefore uses more memory and may be less efficient.

We can also histogram binned data (since binning preserves the 'y' coord), to create 2-D (or N-D) histograms:

[3]:

binned = table.bin(x=xbins)
hist = binned.hist(y=30)
hist.plot()

[3]:

[4]:

hist

[4]:

scipp.DataArray (20.09 KB)
• x: 39
• y: 30
• x
(x [bin-edge])
float64
m
0.0, 0.026, ..., 0.974, 1.0
Values:array([0.        , 0.02564103, 0.05128205, 0.07692308, 0.1025641 ,
0.12820513, 0.15384615, 0.17948718, 0.20512821, 0.23076923,
0.25641026, 0.28205128, 0.30769231, 0.33333333, 0.35897436,
0.38461538, 0.41025641, 0.43589744, 0.46153846, 0.48717949,
0.51282051, 0.53846154, 0.56410256, 0.58974359, 0.61538462,
0.64102564, 0.66666667, 0.69230769, 0.71794872, 0.74358974,
0.76923077, 0.79487179, 0.82051282, 0.84615385, 0.87179487,
0.8974359 , 0.92307692, 0.94871795, 0.97435897, 1.        ])
• y
(y [bin-edge])
float64
m
4.809e-05, 0.033, ..., 0.967, 1.000
Values:array([4.80862445e-05, 3.33765299e-02, 6.67049736e-02, 1.00033417e-01,
1.33361861e-01, 1.66690305e-01, 2.00018748e-01, 2.33347192e-01,
2.66675636e-01, 3.00004080e-01, 3.33332523e-01, 3.66660967e-01,
3.99989411e-01, 4.33317854e-01, 4.66646298e-01, 4.99974742e-01,
5.33303185e-01, 5.66631629e-01, 5.99960073e-01, 6.33288517e-01,
6.66616960e-01, 6.99945404e-01, 7.33273848e-01, 7.66602291e-01,
7.99930735e-01, 8.33259179e-01, 8.66587622e-01, 8.99916066e-01,
9.33244510e-01, 9.66572953e-01, 9.99901397e-01])
• (x, y)
float64
counts
7.582, 20.235, ..., 0.190, 0.239
σ = 2.869, 4.620, ..., 5.103, 5.855
Values:array([[7.58172613e+00, 2.02351232e+01, 2.31995873e+01, ...,
1.69939472e+01, 3.57140537e+01, 3.11881213e+01],
[1.73691329e+01, 1.65595530e+01, 3.09718814e+01, ...,
1.84955092e+01, 0.00000000e+00, 3.58537898e+01],
[1.31466400e+00, 1.70584753e+01, 1.03351509e+01, ...,
1.32688376e+01, 0.00000000e+00, 2.77597682e+01],
...,
[6.76079104e-02, 1.98109905e-01, 1.37286272e-01, ...,
1.57239468e-01, 1.05512219e-01, 3.29600961e-01],
[1.57899077e-01, 1.99445696e-02, 2.74342910e-01, ...,
2.70533447e-02, 2.50545343e-01, 2.73122913e-01],
[1.37840595e-01, 1.70302620e-01, 1.16600656e-01, ...,
1.16988525e-01, 1.89712986e-01, 2.39300871e-01]])Variances (σ²):array([[ 8.23287502, 21.34591716, 24.43987329, ..., 18.49035132,
37.33172109, 33.23438063],
[21.66789111, 20.69524814, 37.50204997, ..., 22.3626058 ,
0.        , 43.4679176 ],
[ 1.89383391, 23.58249894, 14.25318455, ..., 18.07670238,
0.        , 38.87903398],
...,
[ 7.22265387, 21.28761605, 14.75878609, ..., 17.22574248,
11.36929808, 35.48529201],
[19.13998701,  2.35500209, 33.52815236, ...,  3.29719634,
29.86862785, 33.4966994 ],
[19.42936355, 24.10318057, 16.32144078, ..., 16.33609426,
26.04022162, 34.28373928]])

Another capability of hist is to histogram a dimension that has previously been binned with a different or higher resolution, i.e. different bin edges. Compare to the plot of the initial example:

[5]:

binned = table.bin(x=xbins)
binned.hist(x=100).plot()

[5]:


## Grouping#

groupby is more flexible in terms of operations than can be applied and may be the go-to solution when a quick one-liner is required. Limitations are:

• Can only group along a single dimension.

• Works best for small to medium-sized data, or if data is already mostly sorted along the grouping dimension. Slow if millions of small input slices contribute to each group.

groupby can also operate on binned data, combining bin contents by concatenation:

[6]:

binned = table.bin(x=xbins)
binned.coords['param'] = sc.array(
dims=['x'], values=(np.random.random(39) * 4).astype(np.int32)
)
grouped = binned.groupby('param').concat('x')
grouped

[6]:

scipp.DataArray (158.70 KB)
• param: 4
• param
(param)
int32
𝟙
0, 1, 2, 3
Values:array([0, 1, 2, 3], dtype=int32)
• (param)
DataArrayView
binned data [len=1116, len=626, len=1675, len=1583]
dim='position',
content=DataArray(
dims=(position: 5000),
data=float64[counts],
coords={'x':float64[m], 'y':float64[m]})

Each output bin is a combination of multiple input bins:

[7]:

grouped.values[0]

[7]:

scipp.DataArray (36.13 KB out of 157.50 KB)
• position: 1116
• x
(position)
float64
m
0.019, 0.016, ..., 0.937, 0.924
Values:array([0.01892755, 0.01568035, 0.01799771, ..., 0.92438293, 0.937194  ,
0.92385361])
• y
(position)
float64
m
0.698, 0.875, ..., 0.904, 0.133
Values:array([0.69823497, 0.87510994, 0.15940353, ..., 0.88123113, 0.90351234,
0.13303128])
• (position)
float64
counts
7.865, 5.942, ..., 0.056, 0.022
σ = 2.940, 2.535, ..., 2.463, 1.490
Values:array([7.86537111, 5.94193833, 3.28414934, ..., 0.07648916, 0.05596949,
0.02189663])Variances (σ²):array([8.64609241, 6.42654556, 3.59339068, ..., 7.77807152, 6.06795427,
2.22075158])

## Binning#

scipp.bin actually reorders data and meta data such that all data contributing to a bin is in a contiguous block. Binning along multiple dimensions is supported. Of the three options it is the only solution that supports modifying data in the grouped/binned layout. A variety of operations on such binned data is available. Limitations are:

• Requires copying and reordering the input data and can thus become expensive.

In the above example the 'y' information is dropped by hist and groupby, but bin preserves it:

[8]:

binned = table.bin(x=xbins)
binned.values[0]

[8]:

scipp.DataArray (5.63 KB out of 157.50 KB)
• position: 140
• x
(position)
float64
m
0.019, 0.016, ..., 0.016, 0.017
Values:array([0.01892755, 0.01568035, 0.01799771, 0.0093956 , 0.00331097,
0.02537393, 0.00200557, 0.00685728, 0.01953938, 0.00133971,
0.00406988, 0.01405699, 0.0218467 , 0.0250284 , 0.01114841,
0.01386278, 0.00150371, 0.00217352, 0.00936399, 0.02276897,
0.00947814, 0.02455649, 0.0041042 , 0.00902788, 0.00747038,
0.00278616, 0.00672212, 0.01588494, 0.01409641, 0.0091927 ,
0.02101469, 0.01404844, 0.01460596, 0.01388844, 0.01802982,
0.00037873, 0.02255775, 0.00746013, 0.0050761 , 0.02559401,
0.02019367, 0.01590959, 0.02332724, 0.00958907, 0.02367271,
0.0254768 , 0.00841613, 0.01712171, 0.02493578, 0.01379626,
0.01491825, 0.01673711, 0.00150649, 0.00577279, 0.0067932 ,
0.00103564, 0.00822999, 0.02533333, 0.01306692, 0.00518236,
0.01823981, 0.0113155 , 0.00691065, 0.01428515, 0.00709729,
0.00766012, 0.00598963, 0.01496764, 0.0140292 , 0.02368304,
0.02418214, 0.01292106, 0.00558977, 0.0234116 , 0.01580534,
0.00742722, 0.02457354, 0.0080968 , 0.00136156, 0.01905799,
0.0008584 , 0.02263881, 0.00308303, 0.01197309, 0.01618743,
0.00613042, 0.0002173 , 0.02513708, 0.01067147, 0.02294755,
0.00633796, 0.01870394, 0.01298391, 0.024254  , 0.0051406 ,
0.00059935, 0.00521056, 0.006682  , 0.01028026, 0.00015191,
0.00778345, 0.01742508, 0.01494642, 0.02488724, 0.02093666,
0.01890198, 0.02533543, 0.01647888, 0.00586367, 0.01732017,
0.01553505, 0.01119958, 0.00921314, 0.01209774, 0.01495178,
0.02438064, 0.02302728, 0.02107059, 0.01266957, 0.0102288 ,
0.02480437, 0.0158001 , 0.0009269 , 0.00297359, 0.0229043 ,
0.01920159, 0.00250453, 0.01733948, 0.02002405, 0.02242311,
0.01558565, 0.01142298, 0.00105847, 0.01744881, 0.02201983,
0.01569998, 0.01606719, 0.0149429 , 0.01649062, 0.01710176])
• y
(position)
float64
m
0.698, 0.875, ..., 0.398, 0.371
Values:array([0.69823497, 0.87510994, 0.15940353, 0.93785565, 0.67014892,
0.36640102, 0.56715058, 0.88179892, 0.99224849, 0.10970908,
0.96265957, 0.3727967 , 0.44765894, 0.34494747, 0.38602661,
0.21655311, 0.0967094 , 0.03365053, 0.4387095 , 0.13190338,
0.30137005, 0.72069786, 0.27710994, 0.77133408, 0.04755307,
0.96091109, 0.70543546, 0.07845321, 0.89275829, 0.14271776,
0.75998307, 0.99076898, 0.85302324, 0.36902216, 0.47635831,
0.68488014, 0.49960638, 0.57597779, 0.72481177, 0.42451883,
0.83080745, 0.51640007, 0.74849907, 0.28138088, 0.6826303 ,
0.85636466, 0.7003451 , 0.74128462, 0.07746111, 0.65988267,
0.62695441, 0.8510591 , 0.53905735, 0.63739458, 0.70855963,
0.53487469, 0.89837216, 0.58243315, 0.81274117, 0.2450869 ,
0.66833095, 0.85252724, 0.60718191, 0.41999578, 0.23935953,
0.96227295, 0.49709159, 0.72548661, 0.6518568 , 0.15574014,
0.54197179, 0.11804897, 0.12125647, 0.77160634, 0.60492539,
0.63878006, 0.371926  , 0.59773096, 0.45394831, 0.89930687,
0.15981966, 0.77884395, 0.2194764 , 0.15532049, 0.13847045,
0.47053321, 0.44082455, 0.70025961, 0.77785986, 0.94802717,
0.14006271, 0.14066617, 0.41502572, 0.50245291, 0.71256208,
0.97376637, 0.91809631, 0.55160283, 0.69626688, 0.81499275,
0.78663619, 0.41914464, 0.33036249, 0.09830041, 0.39197549,
0.30043858, 0.82911825, 0.02909849, 0.93904162, 0.25979734,
0.31286872, 0.08045491, 0.800751  , 0.53999104, 0.14014067,
0.03563495, 0.79886727, 0.91306605, 0.24966918, 0.1519607 ,
0.73143975, 0.46486746, 0.07240336, 0.63285418, 0.71205719,
0.67254955, 0.24659366, 0.11237074, 0.97735258, 0.78862888,
0.17835759, 0.4313999 , 0.77770787, 0.91305134, 0.25554684,
0.03890394, 0.71143466, 0.50144737, 0.39824   , 0.37100183])
• (position)
float64
counts
7.865, 5.942, ..., 0.815, 3.750
σ = 2.940, 2.535, ..., 0.941, 2.021
Values:array([7.86537111, 5.94193833, 3.28414934, 2.24870222, 2.26290264,
3.86194657, 6.89121003, 4.5302094 , 7.795778  , 6.32295019,
3.4613793 , 9.26210344, 2.49603271, 6.27252728, 3.88498037,
0.48745221, 5.33330814, 7.99079713, 2.17212408, 1.0547509 ,
8.11731592, 6.71151596, 0.36213656, 8.24050657, 1.39435765,
7.46949032, 2.17328191, 5.33481465, 2.82248661, 1.76702051,
0.98850972, 6.43787156, 1.1371564 , 6.19070734, 4.52162623,
9.65754005, 5.12851144, 6.15372125, 8.76683513, 2.6741406 ,
4.44616684, 0.66495949, 5.34158741, 7.65057928, 6.04764066,
8.02231826, 7.01857177, 8.00431468, 1.13790912, 4.99954725,
7.2621896 , 0.44326554, 5.76768052, 0.37966304, 6.42968731,
7.09978835, 6.18698578, 7.98937514, 2.57216271, 6.81886046,
6.90741371, 2.59964193, 2.55009511, 4.3022724 , 7.54589522,
7.61669279, 3.04586429, 5.42031453, 6.95982434, 6.3464283 ,
6.83108862, 0.23013868, 6.18117187, 2.07859734, 5.21140243,
7.65535555, 7.14082719, 3.5080198 , 9.34191919, 5.43017868,
3.83465243, 6.00970234, 6.6637854 , 1.94632702, 3.33995824,
1.40297141, 9.22981004, 5.9917318 , 3.30408296, 6.45064242,
7.63510315, 2.1028344 , 5.04364317, 1.15192906, 4.41068071,
9.84971516, 3.24319404, 1.33946368, 0.87762148, 6.83403898,
6.80042743, 0.01251295, 1.81226769, 3.45650826, 2.92048907,
9.0428148 , 6.70534587, 7.58172613, 8.46714664, 4.21300297,
5.59331316, 2.13504364, 7.60939477, 7.01361961, 6.27012717,
1.7583655 , 8.85144391, 7.88576881, 5.37161382, 0.85353242,
8.09557839, 7.89707566, 5.80200353, 1.79099921, 0.16579631,
8.85514883, 9.06356779, 0.77320984, 7.10475657, 5.84106039,
6.65358773, 5.92162495, 7.47555297, 5.86498437, 3.54212455,
9.09160288, 1.01746609, 0.31656529, 0.81507769, 3.74982773])Variances (σ²):array([8.64609241, 6.42654556, 3.59339068, 2.35686243, 2.30067641,
4.38434829, 6.96066176, 4.68822755, 8.59584705, 6.36544687,
3.53253785, 9.93651252, 2.78413172, 7.10871391, 4.10768658,
0.52243792, 5.37355809, 8.07811144, 2.27624113, 1.18193093,
8.51126226, 7.58829783, 0.36964474, 8.62100098, 1.44742447,
7.5742743 , 2.24756859, 5.775812  , 3.02859954, 1.85013448,
1.09802889, 6.90634245, 1.22331029, 6.6358819 , 4.94818476,
9.67584527, 5.74083224, 6.38759358, 8.99218929, 3.03921157,
4.91853496, 0.72001645, 6.002398  , 8.02632471, 6.8075464 ,
9.11217518, 7.32022016, 8.71973894, 1.28900598, 5.35659581,
7.82459997, 0.48195684, 5.81128947, 0.3907813 , 6.6518295 ,
7.13664767, 6.44689065, 9.06824926, 2.74582527, 6.99785838,
7.56698234, 2.75096346, 2.63974917, 4.62080576, 7.81848017,
7.91407545, 3.13846193, 5.84152581, 7.46555859, 7.14424679,
7.70904828, 0.24549765, 6.35636527, 2.33672717, 5.63995278,
7.94499044, 8.07438199, 3.65295248, 9.40573387, 5.97307558,
3.8511461 , 6.72996038, 6.76730447, 2.06640315, 3.62152591,
1.44664128, 9.23984355, 6.79417678, 3.48516817, 7.23490664,
7.8809328 , 2.30898001, 5.38193639, 1.30044692, 4.52551797,
9.87927665, 3.32879865, 1.38497111, 0.92391188, 6.8392318 ,
7.07029865, 0.01365205, 1.95289133, 3.91452928, 3.2427911 ,
9.93913901, 7.61090619, 8.23287502, 8.71906442, 4.59411682,
6.04509413, 2.25801252, 7.96812608, 7.45095792, 6.7568413 ,
1.98632835, 9.93155416, 8.7618991 , 5.72290319, 0.89832103,
9.16452328, 8.54625358, 5.82895542, 1.81782662, 0.18591349,
9.74746269, 9.1777812 , 0.84323683, 7.85291449, 6.53405605,
7.19282835, 6.2696829 , 7.51522101, 6.39965357, 3.954388  ,
9.83405291, 1.10257819, 0.34112331, 0.88513177, 4.08457914])

If we omit the call to bins.sum in the original example, we can subsequently apply another histogramming or binning operation to the data:

[9]:

binned = binned.bin(y=100)
binned

[9]:

scipp.DataArray (220.91 KB)
• x: 39
• y: 100
• x
(x [bin-edge])
float64
m
0.0, 0.026, ..., 0.974, 1.0
Values:array([0.        , 0.02564103, 0.05128205, 0.07692308, 0.1025641 ,
0.12820513, 0.15384615, 0.17948718, 0.20512821, 0.23076923,
0.25641026, 0.28205128, 0.30769231, 0.33333333, 0.35897436,
0.38461538, 0.41025641, 0.43589744, 0.46153846, 0.48717949,
0.51282051, 0.53846154, 0.56410256, 0.58974359, 0.61538462,
0.64102564, 0.66666667, 0.69230769, 0.71794872, 0.74358974,
0.76923077, 0.79487179, 0.82051282, 0.84615385, 0.87179487,
0.8974359 , 0.92307692, 0.94871795, 0.97435897, 1.        ])
• y
(y [bin-edge])
float64
m
4.809e-05, 0.010, ..., 0.990, 1.000
Values:array([4.80862445e-05, 1.00466194e-02, 2.00451525e-02, 3.00436856e-02,
4.00422187e-02, 5.00407518e-02, 6.00392849e-02, 7.00378180e-02,
8.00363511e-02, 9.00348842e-02, 1.00033417e-01, 1.10031950e-01,
1.20030484e-01, 1.30029017e-01, 1.40027550e-01, 1.50026083e-01,
1.60024616e-01, 1.70023149e-01, 1.80021682e-01, 1.90020215e-01,
2.00018748e-01, 2.10017282e-01, 2.20015815e-01, 2.30014348e-01,
2.40012881e-01, 2.50011414e-01, 2.60009947e-01, 2.70008480e-01,
2.80007013e-01, 2.90005546e-01, 3.00004080e-01, 3.10002613e-01,
3.20001146e-01, 3.29999679e-01, 3.39998212e-01, 3.49996745e-01,
3.59995278e-01, 3.69993811e-01, 3.79992344e-01, 3.89990878e-01,
3.99989411e-01, 4.09987944e-01, 4.19986477e-01, 4.29985010e-01,
4.39983543e-01, 4.49982076e-01, 4.59980609e-01, 4.69979142e-01,
4.79977676e-01, 4.89976209e-01, 4.99974742e-01, 5.09973275e-01,
5.19971808e-01, 5.29970341e-01, 5.39968874e-01, 5.49967407e-01,
5.59965940e-01, 5.69964473e-01, 5.79963007e-01, 5.89961540e-01,
5.99960073e-01, 6.09958606e-01, 6.19957139e-01, 6.29955672e-01,
6.39954205e-01, 6.49952738e-01, 6.59951271e-01, 6.69949805e-01,
6.79948338e-01, 6.89946871e-01, 6.99945404e-01, 7.09943937e-01,
7.19942470e-01, 7.29941003e-01, 7.39939536e-01, 7.49938069e-01,
7.59936603e-01, 7.69935136e-01, 7.79933669e-01, 7.89932202e-01,
7.99930735e-01, 8.09929268e-01, 8.19927801e-01, 8.29926334e-01,
8.39924867e-01, 8.49923401e-01, 8.59921934e-01, 8.69920467e-01,
8.79919000e-01, 8.89917533e-01, 8.99916066e-01, 9.09914599e-01,
9.19913132e-01, 9.29911665e-01, 9.39910199e-01, 9.49908732e-01,
9.59907265e-01, 9.69905798e-01, 9.79904331e-01, 9.89902864e-01,
9.99901397e-01])
• (x, y)
DataArrayView
binned data [len=0, len=0, ..., len=3, len=2]
dim='position',
content=DataArray(
dims=(position: 5000),
data=float64[counts],
coords={'x':float64[m], 'y':float64[m]})

As in the 1-D example above, summing the bins is equivalent to histogramming binned data:

[10]:

binned.bins.sum().plot()

[10]: