Data Structures#

To keep this documentation generic we typically use dimensions x or y, but this should not be seen as a recommendation to use these labels for anything but actual positions or offsets in space.

Variable#

Basics#

scipp.Variable is a labeled multi-dimensional array. A variable has the following key properties:

  • values: a multi-dimensional array of values, e.g., a numpy.ndarray

  • variances: a (optional) multi-dimensional array of variances for the array values

  • dims: a list of dimension labels (strings) for each axis of the array

  • unit: a (optional) physical unit of the values in the array

Note that variables, unlike DataArray and its eponym xarray.DataArray, variables do not have coordinate dicts.

[1]:
import numpy as np
import scipp as sc

Variables should generally be created using one of the available creation functions. For example, we can create a variable from a numpy array:

[2]:
var = sc.array(dims=['x', 'y'], values=np.random.rand(2, 4))

Note:

Internally scipp is not using numpy, so the above makes a copy of the numpy array of values into an internal buffer.

We can inspect the created variable as follows:

[3]:
sc.show(var)
dims=['x', 'y'], shape=[2, 4], unit=dimensionless, variances=Falsevalues xy
[4]:
var
[4]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (64 Bytes)
    • (x: 2, y: 4)
      float64
      𝟙
      0.894, 0.142, ..., 0.204, 0.242
      Values:
      array([[0.89423414, 0.14236238, 0.36171925, 0.82784281], [0.07544893, 0.11382585, 0.2040739 , 0.24180717]])
[5]:
var.unit
[5]:
dimensionless
[6]:
var.values
[6]:
array([[0.89423414, 0.14236238, 0.36171925, 0.82784281],
       [0.07544893, 0.11382585, 0.2040739 , 0.24180717]])
[7]:
print(var.variances)
None

Variances must have the same shape as values, and units are specified using the scipp.units module or with a string:

[8]:
var = sc.array(dims=['x', 'y'],
               unit='m/s',
               values=np.random.rand(2, 4),
               variances=np.random.rand(2, 4))
sc.show(var)
dims=['x', 'y'], shape=[2, 4], unit=m/s, variances=Truevariances xyvalues xy
[9]:
var
[9]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (128 Bytes)
    • (x: 2, y: 4)
      float64
      m/s
      0.215, 0.237, ..., 0.834, 0.125
      σ = 0.748, 0.881, ..., 0.847, 0.253
      Values:
      array([[0.21466711, 0.23704298, 0.8594167 , 0.02418335], [0.30101171, 0.70461868, 0.83374554, 0.12526722]])

      Variances (σ²):
      array([[0.55991155, 0.77539278, 0.11906929, 0.02527923], [0.01919702, 0.54143084, 0.71718813, 0.0640392 ]])
[10]:
var.variances
[10]:
array([[0.55991155, 0.77539278, 0.11906929, 0.02527923],
       [0.01919702, 0.54143084, 0.71718813, 0.0640392 ]])

0-D variables (scalars)#

A 0-dimensional variable contains a single value (and an optional variance). The most convenient way to create a scalar variable is by multiplying a value by a unit:

[11]:
scalar = 1.2 * sc.units.m
sc.show(scalar)
scalar
dims=[], shape=[], unit=m, variances=Falsevalues
[11]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (8 Bytes)
    • ()
      float64
      m
      1.2
      Values:
      array(1.2)

Singular versions of the values and variances properties are provided:

[12]:
print(scalar.value)
print(scalar.variance)
1.2
None

An exception is raised from the value and variance properties if the variable is not 0-dimensional. Note that a variable with one or more dimension extent(s) of 1 contains just a single value as well, but the value property will nevertheless raise an exception.

Creating scalar variables with variances or with custom dtype or variances is possible using scipp.scalar:

[13]:
var_0d = sc.scalar(value=1.0, variance=0.5, dtype=sc.DType.float32, unit='kg')
var_0d
[13]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (8 Bytes)
    • ()
      float32
      kg
      1.0
      σ = 0.70710677
      Values:
      array(1., dtype=float32)

      Variances (σ²):
      array(0.5, dtype=float32)
[14]:
var_0d.value = 2.3
var_0d.variance
[14]:
0.5

DataArray#

Basics#

scipp.DataArray is a labeled array with associated coordinates. A data array is essentially a Variable object with attached dicts of coordinates, masks, and attributes.

A data array has the following key properties:

  • data: the variable holding the array data.

  • coords: a dict-like container of coordinates for the array, accessed using a string as dict key.

  • masks: a dict-like container of masks for the array, accessed using a string as dict key.

  • attrs: a dict-like container of “attributes” for the array, accessed using a string as dict key.

The key distinction between coordinates (added via the coords property) and attributes (added via the attrs property) is that the former are required to match (“align”) in operations between data arrays whereas the latter are not.

masks allows for storing boolean-valued masks alongside data. All four have items that are internally a Variable, i.e., they have a physical unit and optionally variances.

[15]:
a = sc.DataArray(
    data = sc.array(dims=['y', 'x'], values=np.random.rand(2, 3)),
    coords={
        'y': sc.array(dims=['y'], values=np.arange(2.0), unit='m'),
        'x': sc.array(dims=['x'], values=np.arange(3.0), unit='m')},
    attrs={
        'aux': sc.array(dims=['x'], values=np.random.rand(3))})
sc.show(a)
(dims=['y', 'x'], shape=[2, 3], unit=dimensionless, variances=False)values yx yy(dims=['y'], shape=[2], unit=m, variances=False)values y xx(dims=['x'], shape=[3], unit=m, variances=False)values x auxaux(dims=['x'], shape=[3], unit=dimensionless, variances=False)values x

Note how the 'aux' attribute is essentially a secondary coordinate for the x dimension. The dict-like coords and masks properties give access to the respective underlying variables:

[16]:
a.coords['x']
[16]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (24 Bytes)
    • (x: 3)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
[17]:
a.attrs['aux']
[17]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (24 Bytes)
    • (x: 3)
      float64
      𝟙
      0.827, 0.098, 0.358
      Values:
      array([0.82727589, 0.09825948, 0.35824026])

Access to coords and attrs in a unified manner is possible with the meta property. Essentially this allows us to ignore whether a coordinate is aligned or not:

[18]:
a.meta['x']
[18]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (24 Bytes)
    • (x: 3)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
[19]:
a.meta['aux']
[19]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (24 Bytes)
    • (x: 3)
      float64
      𝟙
      0.827, 0.098, 0.358
      Values:
      array([0.82727589, 0.09825948, 0.35824026])

Unlike values when creating a variable, data as well as entries in the meta data dicts (coords, masks, and attrs) are not deep-copied on insertion into a data array. To avoid unwanted sharing, call the copy() method. Compare:

[20]:
x2 = sc.zeros(dims=['x'], shape=[3])
a.coords['x2_shared'] = x2
a.coords['x2_copied'] = x2.copy()
x2 += 123
a
[20]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (160 Bytes)
    • y: 2
    • x: 3
    • x
      (x)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
    • x2_copied
      (x)
      float64
      𝟙
      0.0, 0.0, 0.0
      Values:
      array([0., 0., 0.])
    • x2_shared
      (x)
      float64
      𝟙
      123.0, 123.0, 123.0
      Values:
      array([123., 123., 123.])
    • y
      (y)
      float64
      m
      0.0, 1.0
      Values:
      array([0., 1.])
    • (y, x)
      float64
      𝟙
      0.424, 0.404, ..., 0.838, 0.374
      Values:
      array([[0.42445196, 0.40372322, 0.08642007], [0.6260914 , 0.83843135, 0.37416552]])
    • aux
      (x)
      float64
      𝟙
      0.827, 0.098, 0.358
      Values:
      array([0.82727589, 0.09825948, 0.35824026])

Meta data can be removed in the same way as in Python dicts:

[21]:
del a.attrs['aux']

Distinction between dimension coords and non-dimension coords, and coords and attrs#

When the name of a coord matches its dimension, e.g., if d.coord['x'] depends on dimension 'x' as in the above example, we call this coord dimension coordinate. Otherwise it is called non-dimension coord. It is important to highlight that for practical purposes (such as matching in operations) dimension coords and non-dimension coords are handled equivalently. Essentially:

  • Non-dimension coordinates are coordinates.

  • There is at most one dimension coord for each dimension, but there can be multiple non-dimension coords.

  • Operations such as value-based slicing that accept an input dimension and require lookup of coordinate values will only consider dimension coordinates.

As mentioned above, the difference between coords and attrs is “alignment”, i.e., only the former are compared in operations. The concept of dimension coords is unrelated to the distinction between coords or attrs. In particular, dimension coords could be made attrs if desired, and non-dimension coords can (and often are) “aligned” coords.

Dataset#

scipp.Dataset is a dict-like container of data arrays. Individual items of a dataset (“data arrays”) are accessed using a string as a dict key.

In a dataset the coordinates of the sub-arrays are enforced to be aligned. That is, a dataset is not actually just a dict of data arrays. Instead, the individual arrays share their coordinates. It is therefore not possible to combine arbitrary data arrays into a dataset. If, e.g., the extents in a certain dimension mismatch, or if coordinate values mismatch, insertion of the mismatching data array will fail.

Often a dataset is not created from individual data arrays. Instead we may provide a dict of variables (the data of the items), and dicts for coords:

[22]:
d = sc.Dataset(
            data={
                'a': sc.array(dims=['y', 'x'], values=np.random.rand(2, 3)),
                'b': sc.array(dims=['y'], values=np.random.rand(2)),
                'c': sc.scalar(value=1.0)},
             coords={
                 'x': sc.array(dims=['x'], values=np.arange(3.0), unit='m'),
                 'y': sc.array(dims=['y'], values=np.arange(2.0), unit='m'),
                 'aux': sc.array(dims=['x'], values=np.random.rand(3))})
sc.show(d)
aa(dims=['y', 'x'], shape=[2, 3], unit=dimensionless, variances=False)values yx yy(dims=['y'], shape=[2], unit=m, variances=False)values y bb(dims=['y'], shape=[2], unit=dimensionless, variances=False)values y cc(dims=[], shape=[], unit=dimensionless, variances=False)values xx(dims=['x'], shape=[3], unit=m, variances=False)values x auxaux(dims=['x'], shape=[3], unit=dimensionless, variances=False)values x
[23]:
d
[23]:
Show/Hide data repr Show/Hide attributes
scipp.Dataset (136 Bytes)
    • y: 2
    • x: 3
    • aux
      (x)
      float64
      𝟙
      0.145, 0.869, 0.674
      Values:
      array([0.14544378, 0.86853973, 0.67446091])
    • x
      (x)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
    • y
      (y)
      float64
      m
      0.0, 1.0
      Values:
      array([0., 1.])
    • a
      (y, x)
      float64
      𝟙
      0.744, 0.147, ..., 0.939, 0.076
      Values:
      array([[0.74404493, 0.14671771, 0.63655525], [0.5232824 , 0.93902102, 0.07627952]])
    • b
      (y)
      float64
      𝟙
      0.552, 0.077
      Values:
      array([0.55200737, 0.07663214])
    • c
      ()
      float64
      𝟙
      1.0
      Values:
      array(1.)
[24]:
d.coords['x'].values
[24]:
array([0., 1., 2.])

The name of a data item serves as a dict key. Item access returns a new data array which is a view onto the data in the dataset and its corresponding coordinates, i.e., no deep copy is made:

[25]:
sc.show(d['a'])
d['a']
(dims=['y', 'x'], shape=[2, 3], unit=dimensionless, variances=False)values yx yy(dims=['y'], shape=[2], unit=m, variances=False)values y xx(dims=['x'], shape=[3], unit=m, variances=False)values x auxaux(dims=['x'], shape=[3], unit=dimensionless, variances=False)values x
[25]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (112 Bytes)
    • y: 2
    • x: 3
    • aux
      (x)
      float64
      𝟙
      0.145, 0.869, 0.674
      Values:
      array([0.14544378, 0.86853973, 0.67446091])
    • x
      (x)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
    • y
      (y)
      float64
      m
      0.0, 1.0
      Values:
      array([0., 1.])
    • (y, x)
      float64
      𝟙
      0.744, 0.147, ..., 0.939, 0.076
      Values:
      array([[0.74404493, 0.14671771, 0.63655525], [0.5232824 , 0.93902102, 0.07627952]])

Use the copy() method to turn the view into an independent object:

[26]:
copy_of_a = d['a'].copy()
copy_of_a += 17  # does not change d['a']
d
[26]:
Show/Hide data repr Show/Hide attributes
scipp.Dataset (136 Bytes)
    • y: 2
    • x: 3
    • aux
      (x)
      float64
      𝟙
      0.145, 0.869, 0.674
      Values:
      array([0.14544378, 0.86853973, 0.67446091])
    • x
      (x)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
    • y
      (y)
      float64
      m
      0.0, 1.0
      Values:
      array([0., 1.])
    • a
      (y, x)
      float64
      𝟙
      0.744, 0.147, ..., 0.939, 0.076
      Values:
      array([[0.74404493, 0.14671771, 0.63655525], [0.5232824 , 0.93902102, 0.07627952]])
    • b
      (y)
      float64
      𝟙
      0.552, 0.077
      Values:
      array([0.55200737, 0.07663214])
    • c
      ()
      float64
      𝟙
      1.0
      Values:
      array(1.)

Each data item is linked to its corresponding coordinates, masks, and attributes. These are accessed using the coords , masks, and attrs properties. The variable holding the data of the dataset item is accessible via the data property:

[27]:
d['a'].data
[27]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (48 Bytes)
    • (y: 2, x: 3)
      float64
      𝟙
      0.744, 0.147, ..., 0.939, 0.076
      Values:
      array([[0.74404493, 0.14671771, 0.63655525], [0.5232824 , 0.93902102, 0.07627952]])

For convenience, properties of the data variable are also properties of the data item:

[28]:
d['a'].values
[28]:
array([[0.74404493, 0.14671771, 0.63655525],
       [0.5232824 , 0.93902102, 0.07627952]])
[29]:
d['a'].variances
[30]:
d['a'].unit
[30]:
dimensionless

Coordinates of a data item include only those that are relevant to the item’s dimensions, all others are hidden. For example, when accessing 'b', which does not depend on the 'y' dimension, the coord for 'y' as well as the 'aux' coord are not part of the item’s coords:

[31]:
sc.show(d['b'])
(dims=['y'], shape=[2], unit=dimensionless, variances=False)values y yy(dims=['y'], shape=[2], unit=m, variances=False)values y

Similarly, when accessing a 0-dimensional data item, it will have no coordinates:

[32]:
sc.show(d['c'])
(dims=[], shape=[], unit=dimensionless, variances=False)values

All variables in a dataset must have consistent dimensions. Thanks to labeled dimensions, transposed data is supported:

[33]:
d['d'] = sc.array(dims=['x', 'y'], values=np.random.rand(3, 2))
sc.show(d)
d
dd(dims=['x', 'y'], shape=[3, 2], unit=dimensionless, variances=False)values x y aa(dims=['y', 'x'], shape=[2, 3], unit=dimensionless, variances=False)values yx yy(dims=['y'], shape=[2], unit=m, variances=False)values y bb(dims=['y'], shape=[2], unit=dimensionless, variances=False)values y cc(dims=[], shape=[], unit=dimensionless, variances=False)values xx(dims=['x'], shape=[3], unit=m, variances=False)values x auxaux(dims=['x'], shape=[3], unit=dimensionless, variances=False)values x
[33]:
Show/Hide data repr Show/Hide attributes
scipp.Dataset (184 Bytes)
    • y: 2
    • x: 3
    • aux
      (x)
      float64
      𝟙
      0.145, 0.869, 0.674
      Values:
      array([0.14544378, 0.86853973, 0.67446091])
    • x
      (x)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
    • y
      (y)
      float64
      m
      0.0, 1.0
      Values:
      array([0., 1.])
    • a
      (y, x)
      float64
      𝟙
      0.744, 0.147, ..., 0.939, 0.076
      Values:
      array([[0.74404493, 0.14671771, 0.63655525], [0.5232824 , 0.93902102, 0.07627952]])
    • b
      (y)
      float64
      𝟙
      0.552, 0.077
      Values:
      array([0.55200737, 0.07663214])
    • c
      ()
      float64
      𝟙
      1.0
      Values:
      array(1.)
    • d
      (x, y)
      float64
      𝟙
      0.771, 0.810, ..., 0.074, 0.511
      Values:
      array([[0.77056014, 0.81032742], [0.71253242, 0.09154087], [0.07447874, 0.5114729 ]])

When inserting a data array or variable into a dataset ownership is shared by default. Use the copy() method to avoid this if undesirable:

[34]:
d['a_shared'] = a
d['a_copied'] = a.copy()
a += 1000
d
[34]:
Show/Hide data repr Show/Hide attributes
scipp.Dataset (328 Bytes)
    • y: 2
    • x: 3
    • aux
      (x)
      float64
      𝟙
      0.145, 0.869, 0.674
      Values:
      array([0.14544378, 0.86853973, 0.67446091])
    • x
      (x)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
    • x2_copied
      (x)
      float64
      𝟙
      0.0, 0.0, 0.0
      Values:
      array([0., 0., 0.])
    • x2_shared
      (x)
      float64
      𝟙
      123.0, 123.0, 123.0
      Values:
      array([123., 123., 123.])
    • y
      (y)
      float64
      m
      0.0, 1.0
      Values:
      array([0., 1.])
    • a
      (y, x)
      float64
      𝟙
      0.744, 0.147, ..., 0.939, 0.076
      Values:
      array([[0.74404493, 0.14671771, 0.63655525], [0.5232824 , 0.93902102, 0.07627952]])
    • a_copied
      (y, x)
      float64
      𝟙
      0.424, 0.404, ..., 0.838, 0.374
      Values:
      array([[0.42445196, 0.40372322, 0.08642007], [0.6260914 , 0.83843135, 0.37416552]])
    • a_shared
      (y, x)
      float64
      𝟙
      1000.424, 1000.404, ..., 1000.838, 1000.374
      Values:
      array([[1000.42445196, 1000.40372322, 1000.08642007], [1000.6260914 , 1000.83843135, 1000.37416552]])
    • b
      (y)
      float64
      𝟙
      0.552, 0.077
      Values:
      array([0.55200737, 0.07663214])
    • c
      ()
      float64
      𝟙
      1.0
      Values:
      array(1.)
    • d
      (x, y)
      float64
      𝟙
      0.771, 0.810, ..., 0.074, 0.511
      Values:
      array([[0.77056014, 0.81032742], [0.71253242, 0.09154087], [0.07447874, 0.5114729 ]])

The usual dict-like methods are available for Dataset:

[35]:
for name in d:
    print(name)
a_copied
a_shared
c
b
d
a
[36]:
'a' in d
[36]:
True
[37]:
'e' in d
[37]:
False