1-D datasets and tables#

Multi-dimensional data arrays with labeled dimensions using scipp

scipp is heavily inspired by xarray. While for many applications xarray is certainly more suitable (and definitely much more matured) than scipp, there is a number of features missing in other situations. If your use case requires one or several of the items on the following list, using scipp may be worth considering:

Handling of physical units.
Propagation of uncertainties.
Support for histograms, i.e. bin-edge axes, which are by 1 longer than the data extent.
Support for event data, a particular form of sparse data with 1-D (or N-D) arrays of random-length lists, with very small list entries.
Written in C++ for better performance (for certain applications), in combination with Python bindings.

This notebook demonstrates key functionality and usage of the scipp library. See the documentation for more information.

Getting started#

What is a `Dataset`?#

The central data container in scipp is called a Dataset. There are two basic analogies to aid in thinking about a Dataset:

As a dict of numpy.ndarrays, with the addition of named dimensions and units.
As a table.

Creating a dataset#

[1]:

import numpy as np
import scipp as sc

We start by creating an empty dataset:

[2]:

d = sc.Dataset()
d

[2]:

scipp.Dataset (0 Bytes)

Dimensions:
Data: (0)

Summary#

There is a number of ways to select and operate on a single row, a range of rows, a single variable (column) or multiple variables (columns) of a dataset:

[19]:

# Single row (dropping corresponding coordinates)
sc.table(d['row', 0])
# Size-1 row range (keeping corresponding coordinates)
sc.table(d['row', 0:1])
# Range of rows
sc.table(d['row', 1:3])
# Single column (column pair if variance is present) including coordinate columns
sc.table(d["alice"])
# Single variable (column pair if variance is present)
sc.table(d["alice"].data)
# Column values without header
sc.table(d["alice"].values)

Exercise 1#

Combining row slicing and “column” indexing, add the last row of the data for 'alice' to the first row of data for 'bob'.
Using the slice-range notation a:b, try adding the last two rows to the first two rows. Why does this fail?

Solution 1#

[20]:

d['bob']['row', 0] += d['alice']['row', -1]
sc.table(d)

If a range is given when slicing, the corresponding coordinate is preserved, and operations between misaligned data is prevented.

[21]:

try:
    d['bob']['row', 0:2] += d['alice']['row', 1:3]
except RuntimeError as e:
    print(str(e))

Mismatch in coordinate 'row' in operation 'add_equals':
(row: 2)      int64              [s]  [0, 1]
vs
(row: 2)      int64              [s]  [1, 2]

To circumvent the safety catch we can operate on the underlying variables containing the data. The data is accessed using the data property:

[22]:

d['bob']['row', 0:2].data += d['alice']['row', 1:3].data
sc.table(d)

Exercise 2#

The slicing notation for variables (columns) and rows does not return a copy, but a view object. This is very similar to how numpy operates:

[23]:

a_slice = a[0:3]
a_slice += 100
a

[23]:

array([100, 101, 102,   3,   4,   5,   6,   7])

Using the slicing notation, create a new table (or replace the existing dataset d) by one that does not contain the first and last row of d.

Solution 2#

[24]:

d2 = d['row', 1:-1].copy()

# Or:
# from copy import copy
# table = copy(d['row', 1:-1])

sc.table(d2)

Note that the call to copy() is essential. If it is omitted we simply have a view onto the same data, and the orignal data is modified if the view is modified:

[25]:

just_a_view = d['row', 1:-1]
sc.to_html(just_a_view)
just_a_view['alice'].values[0] = 666
sc.table(d)

scipp.Dataset (40 Bytes out of 120 Bytes)

Appending rows and columns#

We can append rows using concat, and add columns using merge:

[26]:

d = sc.concat([d['row', 0:3], d['row', 1:3]], 'row')

eve = sc.Dataset(data={'eve': sc.arange('row', 5.0)})
d = sc.merge(d, eve)

sc.table(d)

Exercise 3#

Add the sum of the data for alice and bob as a new variable (column) to the dataset.

Solution 3#

[27]:

d['sum'] = d['alice'] + d['bob']
sc.table(d)

Interaction with `numpy` and scalars#

Values (or variances) in a dataset are exposed in a numpy-compatible buffer format. Direct access to the numpy-like underlying data array is possible using the values and variances properties:

[28]:

d['eve'].values

[28]:

array([0., 1., 2., 3., 4.])

[29]:

d['alice'].variances

[29]:

array([0.434 , 0.394 , 0.1152, 0.394 , 0.1152])

We can directly hand the buffers to numpy functions:

[30]:

d['eve'].values = np.exp(d['eve'].values)
sc.table(d)

Exercise 4#

As above for np.exp applied to the data for Eve, apply a numpy function to the data for Alice.
What happens to the unit and uncertanties when modifying data with external code such as numpy?

Solution 4#

[31]:

d['alice'].values = np.sin(d['alice'].values)
sc.table(d)

Numpy operations are not aware of the unit and uncertainties. Therefore the result is “garbage”, unless the user has ensured herself that units and uncertainties are handled manually.

Corollary: Whenever available, built-in operators and functions should be preferred over the use of numpy: these will handle units and uncertanties for you.

Exercise 5#

Try adding a scalar value such as 1.5 to the values for 'eve' or and 'alice'.
Try the same using the data property, instead of the values property. Why is it not working for 'alice'?

Solution 5#

[32]:

d['eve'].values += 1.5
d['alice'].values += 1.5
sc.table(d)

Instead of values we can use the data property. This will also correctly deal with variances, if applicable, whereas the direction operation with values is unaware of the presence of variances:

[33]:

d['eve'].data += 1.5

The data for Alice has a unit, so a direct addition with a dimensionless quantity fails:

[34]:

try:
    d['alice'].data += 1.5
except RuntimeError as e:
    print(str(e))

Cannot add m^2 and dimensionless.

We can use Variable to provide a scalar quantity with attached unit:

[35]:

scale = sc.scalar(1.5, unit=sc.units.m*sc.units.m)

As a short-hand for creating a scalar variable, just multiply a value by a unit:

[36]:

scale = 1.5 * (sc.units.m*sc.units.m)
d['alice'].data += scale

sc.table(d)

Continue to Part 2 - Multi-dimensional datasets to see how datasets are used with multi-dimensional data.

1-D datasets and tables

Contents

1-D datasets and tables#

Getting started#

What is a `Dataset`?#

Creating a dataset#

Using `Dataset` as a table#

Summary#

Exercise 1#

Solution 1#

Exercise 2#

Solution 2#

Appending rows and columns#

Exercise 3#

Solution 3#

Interaction with `numpy` and scalars#

Exercise 4#

Solution 4#

Exercise 5#

Solution 5#

1-D datasets and tables

Contents

1-D datasets and tables#

Getting started#

What is a Dataset?#

Creating a dataset#

Using Dataset as a table#

Summary#

Exercise 1#

Solution 1#

Exercise 2#

Solution 2#

Appending rows and columns#

Exercise 3#

Solution 3#

Interaction with numpy and scalars#

Exercise 4#

Solution 4#

Exercise 5#

Solution 5#

What is a `Dataset`?#

Using `Dataset` as a table#

Interaction with `numpy` and scalars#