# Data Structures

To keep this documentation generic we typically use dimensions `x` or `y`, but this should *not* be seen as a recommendation to use these labels for anything but actual positions or offsets in space.

## Variable

### Basics

[scipp.Variable](../../generated/classes/scipp.Variable.rst#scipp.Variable) is a labeled multi-dimensional array.
A variable has the following key properties:

- `values`: a multi-dimensional array of values, e.g., similar to a [numpy.ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)
- `variances`: a (optional) multi-dimensional array of variances for the array values
- `dims`: a list of dimension labels (strings) for each axis of the array
- `unit`: a (optional) physical unit of the values in the array

Note that variables, unlike [DataArray](data-structures.ipynb#DataArray) and its eponym [xarray.DataArray](http://xarray.pydata.org/en/stable/user-guide/data-structures.html#dataarray), do *not* have coordinate dicts.

In [None]:
import numpy as np
import scipp as sc

Variables should generally be created using one of the available [creation functions](../../reference/creation-functions.rst#creation-functions).
For example, we can create a variable from a NumPy array:

In [None]:
var = sc.array(dims=['x', 'y'], values=np.random.rand(2, 4), unit='s')

Using a unit is optional but highly encouraged if the variable represents a physical quantity.
See [Creating Arrays and Datasets](./creating-arrays-and-datasets.rst) for an overview of the different methods for creating variables.

<div class="alert alert-info">

**Note:**

Internally Scipp does not use NumPy, so the above makes a *copy* of the numpy array of values into an internal buffer.
    
</div>

We can inspect the created variable as follows:

In [None]:
sc.show(var)

In [None]:
var

<div class="alert alert-warning">
    <b>WARNING:</b>

The above makes use of IPython's rich output representation, but relying on this feature has a common pitfall:
    
IPython (and thus Jupyter) has an [Output caching system](https://ipython.readthedocs.io/en/stable/interactive/reference.html?highlight=previous#output-caching-system).
By default this keeps the last 1000 cell outputs.
In the above case this is `var` (not the displayed HTML, but the object itself).
If such cell outputs are large then this output cache can consume enormous amounts of memory.

Note that `del var` will *not* release the memory, since the IPython output cache still holds a reference to the same object.
See [this FAQ entry](../../getting-started/faq.rst#scipp-is-using-more-and-more-memory-the-jupyter-kernel-crashes) for clearing or disabling this caching.

</div>

In [None]:
var.unit

In [None]:
var.values

### 0-D variables (scalars)

A 0-dimensional variable contains a single value (and an optional variance).

In [None]:
scalar = sc.scalar(1.2, unit='s')
sc.show(scalar)
scalar

Singular versions of the `values` and `variances` properties are provided:

In [None]:
print(scalar.value)
print(scalar.variance)

An exception is raised from the `value` and `variance` properties if the variable is not 0-dimensional.

<div class="alert alert-info">

**Note:**

Scalar variables are distinct from arrays that contain a single value.
For example, `sc.scalar(1)` is equivalent to `sc.array(dims=[], values=1)`.
But all the following are distinct:

 - `sc.array(dims=[], values=1)`
 - `sc.array(dims=['x'], values=[1])`
 - `sc.array(dims=['x', 'y'], values=[[1]])`

In particular, the first is a scalar while the other two are not; they are arrays with an extent of one.
Accessing the `value` property of one of the latter two variables would raise an exception because this property requires a 0-dimensional variable.

</div>

## DataArray

### Basics

[scipp.DataArray](../../generated/classes/scipp.DataArray.rst#scipp.DataArray) is a labeled array with associated coordinates.
A data array is essentially a [Variable](../../generated/classes/scipp.Variable.rst#scipp.Variable) object with attached dicts of coordinates, masks, and attributes.

A data array has the following key properties:

- `data`: the variable holding the array data (its values, variances, dims, and unit).
- `coords`: a dict-like container of coordinates for the array, accessed using a string as dict key.
- `masks`: a dict-like container of masks for the array, accessed using a string as dict key.
- `attrs`: a dict-like container of "attributes" for the array, accessed using a string as dict key.

The key distinction between coordinates (added via the `coords` property) and attributes (added via the `attrs` property) is that the former are required to match ("align") in operations between data arrays whereas the latter are not.

`masks` allows for storing boolean-valued masks alongside data.

`data` as well as the individual values of the `coords`, `masks`, and `attrs` dictionaries are of type [Variable](../../generated/classes/scipp.Variable.rst#scipp.Variable), i.e., they have a physical unit and can be used for computation.
A data array can be created from variables for its constituents as follows:

In [None]:
da = sc.DataArray(
    data=sc.array(dims=['y', 'x'], values=np.random.rand(2, 3)),
    coords={
        'y': sc.array(dims=['y'], values=np.arange(2.0), unit='m'),
        'x': sc.array(dims=['x'], values=np.arange(3.0), unit='m'),
    },
    attrs={'aux': sc.array(dims=['x'], values=np.random.rand(3))},
)
sc.show(da)

Note how the `'aux'` attribute is essentially a secondary coordinate for the x dimension.
The dict-like `coords` and `masks` properties give access to the respective underlying variables:

In [None]:
da.coords['x']

In [None]:
da.attrs['aux']

Access to coords and attrs in a unified manner is possible with the `meta` property.
Essentially this allows us to ignore whether a coordinate is aligned or not:

In [None]:
da.meta['x']

In [None]:
da.meta['aux']

Unlike `values` when creating a variable, `data` as well as entries in the meta data dicts (`coords`, `masks`, and `attrs`) are *not* deep-copied on insertion into a data array.
To avoid unwanted sharing, call the `copy()` method.
Compare:

In [None]:
x2 = sc.zeros(dims=['x'], shape=[3])
da.coords['x2_shared'] = x2
da.coords['x2_copied'] = x2.copy()
x2 += 123
da

Meta data can be removed in the same way as in Python dicts:

In [None]:
del da.attrs['aux']

### Distinction between dimension coords and non-dimension coords, and coords and attrs

When the name of a coord matches its dimension, e.g., if `da.coord['x']` depends on dimension `'x'` as in the above example, we call this coord *dimension coordinate*.
Otherwise it is called *non-dimension coord*.
It is important to highlight that for practical purposes (such as matching in operations) **dimension coords and non-dimension coords are handled equivalently**.
Essentially:

- There is at most one dimension coord for each dimension, but there can be multiple non-dimension coords.
- The dimension coordinate is the "primary" coordinate and will be used, e.g., when creating a plot (unless specified otherwise).

As mentioned above, the difference between coords and attrs is "alignment", i.e., only the former are compared in operations.
The concept of dimension coords is unrelated to the distinction between `coords` or `attrs`.
In particular, dimension coords could be made attrs if desired, and non-dimension coords can (and often are) "aligned" coords.

## Dataset

[scipp.Dataset](../../generated/classes/scipp.Dataset.rst#scipp.Dataset) is a dict-like container of data arrays.
Individual items of a dataset ("data arrays") are accessed using a string as a dict key.

In a dataset the coordinates of the sub-arrays are enforced to be *aligned*.
That is, a dataset is not simply a dict of data arrays.
Instead, the individual arrays share their coordinates.
It is therefore not possible to combine *arbitrary* data arrays into a dataset.
If, e.g., the extents in a certain dimension mismatch, or if coordinate values mismatch, insertion of the mismatching data array will fail.

Often a dataset is not created from individual data arrays.
Instead we may provide a dict of variables (the data of the items), and dicts for coords:

In [None]:
ds = sc.Dataset(
    data={
        'a': sc.array(dims=['y', 'x'], values=np.random.rand(2, 3)),
        'b': sc.array(dims=['y'], values=np.random.rand(2)),
        'c': sc.scalar(value=1.0),
    },
    coords={
        'x': sc.array(dims=['x'], values=np.arange(3.0), unit='m'),
        'y': sc.array(dims=['y'], values=np.arange(2.0), unit='m'),
        'aux': sc.array(dims=['x'], values=np.random.rand(3)),
    },
)
sc.show(ds)

In [None]:
ds

In [None]:
ds.coords['x'].values

The name of a data item serves as a dict key.
Item access returns a new data array which is a view onto the data in the dataset and its corresponding coordinates, i.e., no deep copy is made:

In [None]:
sc.show(ds['a'])
ds['a']

Use the `copy()` method to turn the view into an independent object:

In [None]:
copy_of_a = ds['a'].copy()
copy_of_a += 17  # does not change d['a']
ds

Each data item is linked to its corresponding coordinates, masks, and attributes.
These are accessed using the `coords` , `masks`, and `attrs` properties.
The variable holding the data of the dataset item is accessible via the `data` property:

In [None]:
ds['a'].data

For convenience, properties of the data variable are also properties of the data item:

In [None]:
ds['a'].values

In [None]:
ds['a'].variances

In [None]:
ds['a'].unit

Coordinates of a data item include only those that are relevant to the item's dimensions, all others are hidden.
For example, when accessing `'b'`, which does not depend on the `'y'` dimension, the coord for `'y'` as well as the `'aux'` coord are not part of the item's `coords`:

In [None]:
sc.show(ds['b'])

Similarly, when accessing a 0-dimensional data item, it will have no coordinates:

In [None]:
sc.show(ds['c'])

In [None]:
ds['a'].variances

In [None]:
ds['a'].unit

Coordinates of a data item include only those that are relevant to the item's dimensions, all others are hidden.
For example, when accessing `'b'`, which does not depend on the `'y'` dimension, the coord for `'y'` as well as the `'aux'` coord are not part of the item's `coords`:

In [None]:
sc.show(ds['b'])

Similarly, when accessing a 0-dimensional data item, it will have no coordinates:

In [None]:
sc.show(ds['c'])

All variables in a dataset must have consistent dimensions.
Thanks to labeled dimensions, transposed data is supported:

In [None]:
ds['d'] = sc.array(dims=['x', 'y'], values=np.random.rand(3, 2))
sc.show(ds)
ds

When inserting a data array or variable into a dataset ownership is shared by default.
Use the `copy()` method to avoid this if undesirable:

In [None]:
ds['da_shared'] = da
ds['da_copied'] = da.copy()
da += 1000
ds

The usual `dict`-like methods are available for `Dataset`:

In [None]:
for name in ds:
    print(name)

In [None]:
'a' in ds

In [None]:
'e' in ds