Data Structures#

To keep this documentation generic we typically use dimensions x or y, but this should not be seen as a recommendation to use these labels for anything but actual positions or offsets in space.

Variable#

Basics#

scipp.Variable is a labeled multi-dimensional array. A variable has the following key properties:

  • values: a multi-dimensional array of values, e.g., similar to a numpy.ndarray

  • variances: a (optional) multi-dimensional array of variances for the array values

  • dims: a list of dimension labels (strings) for each axis of the array

  • unit: a (optional) physical unit of the values in the array

Note that variables, unlike DataArray and its eponym xarray.DataArray, do not have coordinate dicts.

[1]:
import numpy as np
import scipp as sc

Variables should generally be created using one of the available creation functions. For example, we can create a variable from a NumPy array:

[2]:
var = sc.array(dims=['x', 'y'], values=np.random.rand(2, 4), unit='s')

Using a unit is optional but highly encouraged if the variable represents a physical quantity. See Creating Arrays and Datasets for an overview of the different methods for creating variables.

Note:

Internally Scipp does not use NumPy, so the above makes a copy of the numpy array of values into an internal buffer.

We can inspect the created variable as follows:

[3]:
sc.show(var)
dims=('x', 'y'), shape=(2, 4), unit=s, variances=Falsevalues xy
[4]:
var
[4]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (320 Bytes)
    • (x: 2, y: 4)
      float64
      s
      0.567, 0.632, ..., 0.501, 0.085
      Values:
      array([[0.56709952, 0.63202538, 0.40247056, 0.8668994 ], [0.32706771, 0.21725486, 0.50121661, 0.08478361]])

WARNING:

The above makes use of IPython’s rich output representation, but relying on this feature has a common pitfall:

IPython (and thus Jupyter) has an Output caching system. By default this keeps the last 1000 cell outputs. In the above case this is var (not the displayed HTML, but the object itself). If such cell outputs are large then this output cache can consume enormous amounts of memory.

Note that del var will not release the memory, since the IPython output cache still holds a reference to the same object. See this FAQ entry for clearing or disabling this caching.

[5]:
var.unit
[5]:
s
[6]:
var.values
[6]:
array([[0.56709952, 0.63202538, 0.40247056, 0.8668994 ],
       [0.32706771, 0.21725486, 0.50121661, 0.08478361]])

0-D variables (scalars)#

A 0-dimensional variable contains a single value (and an optional variance).

[7]:
scalar = sc.scalar(1.2, unit='s')
sc.show(scalar)
scalar
dims=(), shape=(), unit=s, variances=Falsevalues
[7]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (264 Bytes)
    • ()
      float64
      s
      1.2
      Values:
      array(1.2)

Singular versions of the values and variances properties are provided:

[8]:
print(scalar.value)
print(scalar.variance)
1.2
None

An exception is raised from the value and variance properties if the variable is not 0-dimensional.

Note:

Scalar variables are distinct from arrays that contain a single value. For example, sc.scalar(1) is equivalent to sc.array(dims=[], values=1). But all the following are distinct:

  • sc.array(dims=[], values=1)

  • sc.array(dims=['x'], values=[1])

  • sc.array(dims=['x', 'y'], values=[[1]])

In particular, the first is a scalar while the other two are not; they are arrays with an extent of one. Accessing the value property of one of the latter two variables would raise an exception because this property requires a 0-dimensional variable.

DataArray#

Basics#

scipp.DataArray is a labeled array with associated coordinates. A data array is essentially a Variable object with attached dicts of coordinates, masks, and attributes.

A data array has the following key properties:

  • data: the variable holding the array data (its values, variances, dims, and unit).

  • coords: a dict-like container of coordinates for the array.

  • masks: a dict-like container of masks for the array.

  • attrs: a dict-like container of “attributes” for the array.

All dict-likes are accessed using a string as key.

coords can be seen as independent variables of the data array, while data is the dependent variable. This means that coordinates must generally match in operations between data arrays. See the section on alignment in the computation guide for details.

masks allows for storing boolean-valued masks alongside data.

attrs holds additional metadata which is similar to coordinates but is dropped in operations if there is a mismatch.

data as well as the individual values of the coords, masks, and attrs dictionaries are of type Variable, i.e., they have a physical unit and can be used for computation. A data array can be created from variables for its constituents as follows:

[9]:
da = sc.DataArray(
    data=sc.array(dims=['y', 'x'], values=np.random.rand(2, 3)),
    coords={
        'y': sc.array(dims=['y'], values=np.arange(2.0), unit='m'),
        'x': sc.array(dims=['x'], values=np.arange(3.0), unit='m'),
    },
    attrs={'aux': sc.array(dims=['x'], values=np.random.rand(3))},
)
sc.show(da)
(dims=('y', 'x'), shape=(2, 3), unit=dimensionless, variances=False)values yx yy(dims=('y',), shape=(2,), unit=m, variances=False)values y xx(dims=('x',), shape=(3,), unit=m, variances=False)values x auxaux(dims=('x',), shape=(3,), unit=dimensionless, variances=False)values x

Note how the 'aux' attribute is essentially a secondary coordinate for the x dimension. The dict-like coords and masks properties give access to the respective underlying variables:

[10]:
da.coords['x']
[10]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (280 Bytes)
    • (x: 3)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
[11]:
da.attrs['aux']
[11]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (280 Bytes)
    • (x: 3)
      float64
      𝟙
      0.064, 0.078, 0.253
      Values:
      array([0.06439963, 0.07820432, 0.25329319])

Access to coords and attrs in a unified manner is possible with the meta property. Essentially this allows us to ignore whether a coordinate is aligned or not:

[12]:
da.meta['x']
[12]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (280 Bytes)
    • (x: 3)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
[13]:
da.meta['aux']
[13]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (280 Bytes)
    • (x: 3)
      float64
      𝟙
      0.064, 0.078, 0.253
      Values:
      array([0.06439963, 0.07820432, 0.25329319])

Unlike values when creating a variable, data as well as entries in the meta data dicts (coords, masks, and attrs) are not deep-copied on insertion into a data array. To avoid unwanted sharing, call the copy() method. Compare:

[14]:
x2 = sc.zeros(dims=['x'], shape=[3])
da.coords['x2_shared'] = x2
da.coords['x2_copied'] = x2.copy()
x2 += 123
da
[14]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (2.16 KB)
    • y: 2
    • x: 3
    • x
      (x)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
    • x2_copied
      (x)
      float64
      𝟙
      0.0, 0.0, 0.0
      Values:
      array([0., 0., 0.])
    • x2_shared
      (x)
      float64
      𝟙
      123.0, 123.0, 123.0
      Values:
      array([123., 123., 123.])
    • y
      (y)
      float64
      m
      0.0, 1.0
      Values:
      array([0., 1.])
    • (y, x)
      float64
      𝟙
      0.989, 0.715, ..., 0.752, 0.994
      Values:
      array([[0.98873761, 0.71544419, 0.07860229], [0.98958325, 0.7519316 , 0.99423811]])
    • aux
      (x)
      float64
      𝟙
      0.064, 0.078, 0.253
      Values:
      array([0.06439963, 0.07820432, 0.25329319])

Meta data can be removed in the same way as in Python dicts:

[15]:
del da.attrs['aux']

Alignment of coordinates can be queried with the aligned property:

[16]:
da.coords['x'].aligned
[16]:
True

It can be set using the set_aligned method of the coordinates:

[17]:
da.coords.set_aligned('x', False)
da.coords['x'].aligned
[17]:
False

Note that the alignment is encoded in Variable. It is, however, only meaningful in a coords dict. Scipp ignores alignment in operations of plain variables and when handling masks.

The alignment flag is preserved when inserting variables into coords dicts:

[18]:
da2 = sc.DataArray(sc.arange('x', 3), coords={'x': da.coords['x']})
da2.coords['x'].aligned
[18]:
False

Dataset#

scipp.Dataset is a dict-like container of data arrays. Individual items of a dataset (“data arrays”) are accessed using a string as a dict key.

In a dataset the coordinates of the sub-arrays are enforced to be aligned. That is, a dataset is not simply a dict of data arrays. Instead, the individual arrays share their coordinates. It is therefore not possible to combine arbitrary data arrays into a dataset. If, e.g., the extents in a certain dimension mismatch, or if coordinate values mismatch, insertion of the mismatching data array will fail.

Often a dataset is not created from individual data arrays. Instead we may provide a dict of variables (the data of the items), and dicts for coords:

[19]:
ds = sc.Dataset(
    data={
        'a': sc.array(dims=['y', 'x'], values=np.random.rand(2, 3)),
        'b': sc.array(dims=['y'], values=np.random.rand(2)),
        'c': sc.scalar(value=1.0),
    },
    coords={
        'x': sc.array(dims=['x'], values=np.arange(3.0), unit='m'),
        'y': sc.array(dims=['y'], values=np.arange(2.0), unit='m'),
        'aux': sc.array(dims=['x'], values=np.random.rand(3)),
    },
)
sc.show(ds)
aa(dims=('y', 'x'), shape=(2, 3), unit=dimensionless, variances=False)values yx yy(dims=('y',), shape=(2,), unit=m, variances=False)values y bb(dims=('y',), shape=(2,), unit=dimensionless, variances=False)values y cc(dims=(), shape=(), unit=dimensionless, variances=False)values auxaux(dims=('x',), shape=(3,), unit=dimensionless, variances=False)values x xx(dims=('x',), shape=(3,), unit=m, variances=False)values x
[20]:
ds
[20]:
Show/Hide data repr Show/Hide attributes
scipp.Dataset (3.34 KB)
    • y: 2
    • x: 3
    • aux
      (x)
      float64
      𝟙
      0.776, 0.766, 0.044
      Values:
      array([0.77606671, 0.766337 , 0.04378998])
    • x
      (x)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
    • y
      (y)
      float64
      m
      0.0, 1.0
      Values:
      array([0., 1.])
    • a
      (y, x)
      float64
      𝟙
      0.709, 0.937, ..., 0.208, 0.895
      Values:
      array([[0.70900861, 0.93682812, 0.20656416], [0.95182341, 0.20813123, 0.89498453]])
    • b
      (y)
      float64
      𝟙
      0.112, 0.287
      Values:
      array([0.11170981, 0.2868136 ])
    • c
      ()
      float64
      𝟙
      1.0
      Values:
      array(1.)
[21]:
ds.coords['x'].values
[21]:
array([0., 1., 2.])

The name of a data item serves as a dict key. Item access returns a new data array which is a view onto the data in the dataset and its corresponding coordinates, i.e., no deep copy is made:

[22]:
sc.show(ds['a'])
ds['a']
(dims=('y', 'x'), shape=(2, 3), unit=dimensionless, variances=False)values yx yy(dims=('y',), shape=(2,), unit=m, variances=False)values y auxaux(dims=('x',), shape=(3,), unit=dimensionless, variances=False)values x xx(dims=('x',), shape=(3,), unit=m, variances=False)values x
[22]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.61 KB)
    • y: 2
    • x: 3
    • aux
      (x)
      float64
      𝟙
      0.776, 0.766, 0.044
      Values:
      array([0.77606671, 0.766337 , 0.04378998])
    • x
      (x)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
    • y
      (y)
      float64
      m
      0.0, 1.0
      Values:
      array([0., 1.])
    • (y, x)
      float64
      𝟙
      0.709, 0.937, ..., 0.208, 0.895
      Values:
      array([[0.70900861, 0.93682812, 0.20656416], [0.95182341, 0.20813123, 0.89498453]])

Use the copy() method to turn the view into an independent object:

[23]:
copy_of_a = ds['a'].copy()
copy_of_a += 17  # does not change d['a']
ds
[23]:
Show/Hide data repr Show/Hide attributes
scipp.Dataset (3.34 KB)
    • y: 2
    • x: 3
    • aux
      (x)
      float64
      𝟙
      0.776, 0.766, 0.044
      Values:
      array([0.77606671, 0.766337 , 0.04378998])
    • x
      (x)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
    • y
      (y)
      float64
      m
      0.0, 1.0
      Values:
      array([0., 1.])
    • a
      (y, x)
      float64
      𝟙
      0.709, 0.937, ..., 0.208, 0.895
      Values:
      array([[0.70900861, 0.93682812, 0.20656416], [0.95182341, 0.20813123, 0.89498453]])
    • b
      (y)
      float64
      𝟙
      0.112, 0.287
      Values:
      array([0.11170981, 0.2868136 ])
    • c
      ()
      float64
      𝟙
      1.0
      Values:
      array(1.)

Each data item is linked to its corresponding coordinates, masks, and attributes. These are accessed using the coords , masks, and attrs properties. The variable holding the data of the dataset item is accessible via the data property:

[24]:
ds['a'].data
[24]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (304 Bytes)
    • (y: 2, x: 3)
      float64
      𝟙
      0.709, 0.937, ..., 0.208, 0.895
      Values:
      array([[0.70900861, 0.93682812, 0.20656416], [0.95182341, 0.20813123, 0.89498453]])

For convenience, properties of the data variable are also properties of the data item:

[25]:
ds['a'].values
[25]:
array([[0.70900861, 0.93682812, 0.20656416],
       [0.95182341, 0.20813123, 0.89498453]])
[26]:
ds['a'].variances
[27]:
ds['a'].unit
[27]:
dimensionless

Coordinates of a data item include only those that are relevant to the item’s dimensions, all others are hidden. For example, when accessing 'b', which does not depend on the 'y' dimension, the coord for 'y' as well as the 'aux' coord are not part of the item’s coords:

[28]:
sc.show(ds['b'])
(dims=('y',), shape=(2,), unit=dimensionless, variances=False)values y yy(dims=('y',), shape=(2,), unit=m, variances=False)values y

Similarly, when accessing a 0-dimensional data item, it will have no coordinates:

[29]:
sc.show(ds['c'])
(dims=(), shape=(), unit=dimensionless, variances=False)values
[30]:
ds['a'].variances
[31]:
ds['a'].unit
[31]:
dimensionless

Coordinates of a data item include only those that are relevant to the item’s dimensions, all others are hidden. For example, when accessing 'b', which does not depend on the 'y' dimension, the coord for 'y' as well as the 'aux' coord are not part of the item’s coords:

[32]:
sc.show(ds['b'])
(dims=('y',), shape=(2,), unit=dimensionless, variances=False)values y yy(dims=('y',), shape=(2,), unit=m, variances=False)values y

Similarly, when accessing a 0-dimensional data item, it will have no coordinates:

[33]:
sc.show(ds['c'])
(dims=(), shape=(), unit=dimensionless, variances=False)values

All variables in a dataset must have consistent dimensions. Thanks to labeled dimensions, transposed data is supported:

[34]:
ds['d'] = sc.array(dims=['x', 'y'], values=np.random.rand(3, 2))
sc.show(ds)
ds
dd(dims=('x', 'y'), shape=(3, 2), unit=dimensionless, variances=False)values x y aa(dims=('y', 'x'), shape=(2, 3), unit=dimensionless, variances=False)values yx yy(dims=('y',), shape=(2,), unit=m, variances=False)values y bb(dims=('y',), shape=(2,), unit=dimensionless, variances=False)values y cc(dims=(), shape=(), unit=dimensionless, variances=False)values auxaux(dims=('x',), shape=(3,), unit=dimensionless, variances=False)values x xx(dims=('x',), shape=(3,), unit=m, variances=False)values x
[34]:
Show/Hide data repr Show/Hide attributes
scipp.Dataset (4.14 KB)
    • y: 2
    • x: 3
    • aux
      (x)
      float64
      𝟙
      0.776, 0.766, 0.044
      Values:
      array([0.77606671, 0.766337 , 0.04378998])
    • x
      (x)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
    • y
      (y)
      float64
      m
      0.0, 1.0
      Values:
      array([0., 1.])
    • a
      (y, x)
      float64
      𝟙
      0.709, 0.937, ..., 0.208, 0.895
      Values:
      array([[0.70900861, 0.93682812, 0.20656416], [0.95182341, 0.20813123, 0.89498453]])
    • b
      (y)
      float64
      𝟙
      0.112, 0.287
      Values:
      array([0.11170981, 0.2868136 ])
    • c
      ()
      float64
      𝟙
      1.0
      Values:
      array(1.)
    • d
      (x, y)
      float64
      𝟙
      0.805, 0.519, ..., 0.485, 0.188
      Values:
      array([[0.8053422 , 0.5191902 ], [0.83437707, 0.29979236], [0.48460997, 0.18829954]])

When inserting a data array or variable into a dataset ownership is shared by default. Use the copy() method to avoid this if undesirable:

[35]:
ds['da_shared'] = da
ds['da_copied'] = da.copy()
da += 1000
ds
[35]:
Show/Hide data repr Show/Hide attributes
scipp.Dataset (6.31 KB)
    • y: 2
    • x: 3
    • aux
      (x)
      float64
      𝟙
      0.776, 0.766, 0.044
      Values:
      array([0.77606671, 0.766337 , 0.04378998])
    • x
      (x)
      float64
      m
      0.0, 1.0, 2.0
      Values:
      array([0., 1., 2.])
    • x2_copied
      (x)
      float64
      𝟙
      0.0, 0.0, 0.0
      Values:
      array([0., 0., 0.])
    • x2_shared
      (x)
      float64
      𝟙
      123.0, 123.0, 123.0
      Values:
      array([123., 123., 123.])
    • y
      (y)
      float64
      m
      0.0, 1.0
      Values:
      array([0., 1.])
    • a
      (y, x)
      float64
      𝟙
      0.709, 0.937, ..., 0.208, 0.895
      Values:
      array([[0.70900861, 0.93682812, 0.20656416], [0.95182341, 0.20813123, 0.89498453]])
    • b
      (y)
      float64
      𝟙
      0.112, 0.287
      Values:
      array([0.11170981, 0.2868136 ])
    • c
      ()
      float64
      𝟙
      1.0
      Values:
      array(1.)
    • d
      (x, y)
      float64
      𝟙
      0.805, 0.519, ..., 0.485, 0.188
      Values:
      array([[0.8053422 , 0.5191902 ], [0.83437707, 0.29979236], [0.48460997, 0.18829954]])
    • da_copied
      (y, x)
      float64
      𝟙
      0.989, 0.715, ..., 0.752, 0.994
      Values:
      array([[0.98873761, 0.71544419, 0.07860229], [0.98958325, 0.7519316 , 0.99423811]])
    • da_shared
      (y, x)
      float64
      𝟙
      1000.989, 1000.715, ..., 1000.752, 1000.994
      Values:
      array([[1000.98873761, 1000.71544419, 1000.07860229], [1000.98958325, 1000.7519316 , 1000.99423811]])

The usual dict-like methods are available for Dataset:

[36]:
for name in ds:
    print(name)
a
b
c
d
da_shared
da_copied
[37]:
'a' in ds
[37]:
True
[38]:
'e' in ds
[38]:
False

DataGroup#

scipp.DataGroup is a dict-like container for arbitrary Scipp or Python objects. Unlike Dataset, DataGroup does not have coords and does not enforce compatible dimensions of its items. A DataGroup can contain other DataGroup objects and thus allows for representing tree-like data. It can be created like a Python dict:

[39]:
import numpy as np

import scipp as sc

dg = sc.DataGroup(
    a=sc.arange('x', 4),
    b=sc.arange('x', 6),
    c=sc.arange('y', 2),
    d=np.ones((2, 3)),
    e='a string',
)
dg
[39]:
  • a
    scipp
    Variable
    (x: 4)
    int64
    𝟙
    0, 1, 2, 3
  • b
    scipp
    Variable
    (x: 6)
    int64
    𝟙
    0, 1, ..., 4, 5
  • c
    scipp
    Variable
    (y: 2)
    int64
    𝟙
    0, 1
  • d
    numpy
    ndarray
    ()
    shape=(2, 3), dtype=float64, values=1.0, ... , 1.0
  • e
    str
    ()
    a string

Just like DataArray, DataGroup provides properties such as dims, shape, and sizes:

[40]:
dg.dims
[40]:
('x', 'y')
[41]:
dg.shape
[41]:
(None, 2)
[42]:
dg.sizes
[42]:
{'x': None, 'y': 2}

The properties return the union of these properties over all the items in the data group. Non-Scipp objects are considered to have dims=() and shape=(). When items have inconsistent size along a dimension then shape and sizes report this as None.

DataGroup supports positional indexing if the shape along the indexed dimension is unique. Label-based indexing is supported if all items have a corresponding coordinate, even if the shape is not unique.

Most Scipp operations also work for DataGroup, provided that the operation works for all items in the group. That is, operations will generally fail if the data group contains non-Scipp objects such as NumPy arrays or other Python objects such as integers or strings.