1-D datasets and tables#

Multi-dimensional data arrays with labeled dimensions using scipp

scipp is heavily inspired by xarray. While for many applications xarray is certainly more suitable (and definitely much more matured) than scipp, there is a number of features missing in other situations. If your use case requires one or several of the items on the following list, using scipp may be worth considering:

  • Handling of physical units.

  • Propagation of uncertainties.

  • Support for histograms, i.e. bin-edge axes, which are by 1 longer than the data extent.

  • Support for event data, a particular form of sparse data with 1-D (or N-D) arrays of random-length lists, with very small list entries.

  • Written in C++ for better performance (for certain applications), in combination with Python bindings.

This notebook demonstrates key functionality and usage of the scipp library. See the documentation for more information.

Getting started#

What is a Dataset?#

The central data container in scipp is called a Dataset. There are two basic analogies to aid in thinking about a Dataset:

  1. As a dict of numpy.ndarrays, with the addition of named dimensions and units.

  2. As a table.

Creating a dataset#

[1]:
import numpy as np
import scipp as sc

We start by creating an empty dataset:

[2]:
d = sc.Dataset()
d
[2]:
Show/Hide data repr Show/Hide attributes
scipp.Dataset (0 Bytes)

    Using Dataset as a table#

    We can think about, and indeed use a dataset as a table. This will demonstrate the basic ways of creating datasets and interacting with them. Columns can be added one by one:

    [3]:
    
    d['alice'] = sc.Variable(dims=['row'], values=[1.0,1.1,1.2],
                             variances=[0.01,0.01,0.02], unit=sc.units.m)
    sc.table(d)
    

    The column for 'alice' contains two sub-columns with values and associated variances (uncertainties). The uncertainties are optional.

    The datatype (dtype) is derived from the provided data, so passing np.arange(3) will yield a variable (column) containing 64-bit integers.

    For many practicle purposes we want to associate a set of values (optionally a unit and variances also) with our dimension. Lets introduce a coordinate for row so that we can assign a row number starting at zero.

    [4]:
    
    d.coords['row'] = sc.Variable(dims=['row'], values=np.arange(3))
    sc.table(d)
    

    Here the coord acts as a row header for the table. Note that the coordinate is itself just a variable.

    More details of the dataset are visible in its string representation:

    [5]:
    
    d
    
    [5]:
    
    Show/Hide data repr Show/Hide attributes
    scipp.Dataset (72 Bytes)
      • row: 3
      • row
        (row)
        int64
        𝟙
        0, 1, 2
        Values:
        array([0, 1, 2])
      • alice
        (row)
        float64
        m
        1.0, 1.1, 1.2
        σ = 0.1, 0.1, 0.141
        Values:
        array([1. , 1.1, 1.2])

        Variances (σ²):
        array([0.01, 0.01, 0.02])

    A data item (column) in a dataset (table) is identified by its name ('alice'). Note how each coordinate and data item is associated with named dimensions, in this case 'row', and also a shape:

    [6]:
    
    print(d.coords['row'].dims)
    print(d.coords['row'].shape)
    print(d['alice'].dims)
    print(d['alice'].shape)
    
    ['row']
    [3]
    ['row']
    [3]
    

    It is important to understand the difference between items in a dataset, the variable that holds the data of the item, and the actual values. The following illustrates the differences:

    [7]:
    
    sc.table(d['alice']) # includes coordinates
    sc.table(d['alice'].data) # the variable holding the data, i.e., the dimension labels, units, values, and optional variances
    sc.table(d['alice'].values) # just the array of values, shorthand for d['alice'].data.values
    

    Each variable (column) comes with a physical unit, which we should set up correctly as early as possible:

    [8]:
    
    print(d.coords['row'].unit)
    print(d['alice'].unit)
    
    dimensionless
    m
    
    [9]:
    
    d.coords['row'].unit = sc.units.s
    sc.table(d)
    

    Units and uncertainties are handled automatically in operations:

    [10]:
    
    d *= d
    sc.table(d)
    

    Note how the coordinate is unchanged by this operations. As a rule, operations compare coordinates (and fail if there is a mismatch).

    Operations between columns are supported by indexing into a dataset with a name:

    [11]:
    
    d['bob'] = d['alice']
    sc.table(d)
    d
    
    [11]:
    
    Show/Hide data repr Show/Hide attributes
    scipp.Dataset (120 Bytes)
      • row: 3
      • row
        (row)
        int64
        s
        0, 1, 2
        Values:
        array([0, 1, 2])
      • alice
        (row)
        float64
        m^2
        1.0, 1.210, 1.44
        σ = 0.141, 0.156, 0.24
        Values:
        array([1. , 1.21, 1.44])

        Variances (σ²):
        array([0.02 , 0.0242, 0.0576])
      • bob
        (row)
        float64
        m^2
        1.0, 1.210, 1.44
        σ = 0.141, 0.156, 0.24
        Values:
        array([1. , 1.21, 1.44])

        Variances (σ²):
        array([0.02 , 0.0242, 0.0576])

    For small datasets, the show() function provides a quick graphical preview on the structure of a dataset:

    [12]:
    
    sc.show(d)
    
    bobbob(dims=['row'], shape=[3], unit=m^2, variances=True)variances rowvalues row alicealice(dims=['row'], shape=[3], unit=m^2, variances=True)variances rowvalues row rowrow(dims=['row'], shape=[3], unit=s, variances=False)values row
    [13]:
    
    d['bob'] += d['alice']
    sc.table(d)
    

    The contents of a dataset can also be displayed on a graph using the plot function:

    [14]:
    
    sc.plot(d)
    

    This plot demonstrates the advantage of “labeled” data, provided by a dataset: Axes are automatically labeled and multiple items identified by their name are plotted. Furthermore, scipp’s support for units and uncertainties means that all relevant information is directly included in a default plot.

    Operations between rows are supported by indexing into a dataset with a dimension label and an index.

    Slicing dimensions behaves similar to numpy: If a single index is given, the dimension is dropped, if a range is given, the dimension is kept. For a Dataset, in the former case the corresponding coordinates are dropped, whereas in the latter case it is preserved:

    [15]:
    
    a = np.arange(8)
    
    [16]:
    
    a[4]
    
    [16]:
    
    4
    
    [17]:
    
    a[4:5]
    
    [17]:
    
    array([4])
    
    [18]:
    
    d['row', 1] += d['row', 2]
    sc.table(d)
    

    Note the key advantage over numpy or MATLAB: We specify the index dimension, so we always know which dimension we are slicing. The advantage is not so apparent in 1D, but will become clear once we move to higher-dimensional data.

    Summary#

    There is a number of ways to select and operate on a single row, a range of rows, a single variable (column) or multiple variables (columns) of a dataset:

    [19]:
    
    # Single row (dropping corresponding coordinates)
    sc.table(d['row', 0])
    # Size-1 row range (keeping corresponding coordinates)
    sc.table(d['row', 0:1])
    # Range of rows
    sc.table(d['row', 1:3])
    # Single column (column pair if variance is present) including coordinate columns
    sc.table(d["alice"])
    # Single variable (column pair if variance is present)
    sc.table(d["alice"].data)
    # Column values without header
    sc.table(d["alice"].values)
    

    Exercise 1#

    1. Combining row slicing and “column” indexing, add the last row of the data for 'alice' to the first row of data for 'bob'.

    2. Using the slice-range notation a:b, try adding the last two rows to the first two rows. Why does this fail?

    Solution 1#

    [20]:
    
    d['bob']['row', 0] += d['alice']['row', -1]
    sc.table(d)
    

    If a range is given when slicing, the corresponding coordinate is preserved, and operations between misaligned data is prevented.

    [21]:
    
    try:
        d['bob']['row', 0:2] += d['alice']['row', 1:3]
    except RuntimeError as e:
        print(str(e))
    
    Mismatch in coordinate 'row' in operation 'add_equals':
    (row: 2)      int64              [s]  [0, 1]
    vs
    (row: 2)      int64              [s]  [1, 2]
    

    To circumvent the safety catch we can operate on the underlying variables containing the data. The data is accessed using the data property:

    [22]:
    
    d['bob']['row', 0:2].data += d['alice']['row', 1:3].data
    sc.table(d)
    

    Exercise 2#

    The slicing notation for variables (columns) and rows does not return a copy, but a view object. This is very similar to how numpy operates:

    [23]:
    
    a_slice = a[0:3]
    a_slice += 100
    a
    
    [23]:
    
    array([100, 101, 102,   3,   4,   5,   6,   7])
    

    Using the slicing notation, create a new table (or replace the existing dataset d) by one that does not contain the first and last row of d.

    Solution 2#

    [24]:
    
    d2 = d['row', 1:-1].copy()
    
    # Or:
    # from copy import copy
    # table = copy(d['row', 1:-1])
    
    sc.table(d2)
    

    Note that the call to copy() is essential. If it is omitted we simply have a view onto the same data, and the orignal data is modified if the view is modified:

    [25]:
    
    just_a_view = d['row', 1:-1]
    sc.to_html(just_a_view)
    just_a_view['alice'].values[0] = 666
    sc.table(d)
    
    Show/Hide data repr Show/Hide attributes
    scipp.Dataset (40 Bytes out of 120 Bytes)
      • row: 1
      • row
        (row)
        int64
        s
        1
        Values:
        array([1])
      • alice
        (row)
        float64
        m^2
        11.060
        σ = 0.628
        Values:
        array([11.06])

        Variances (σ²):
        array([0.394])
      • bob
        (row)
        float64
        m^2
        11.060
        σ = 0.628
        Values:
        array([11.06])

        Variances (σ²):
        array([0.394])

    Appending rows and columns#

    We can append rows using concat, and add columns using merge:

    [26]:
    
    d = sc.concat([d['row', 0:3], d['row', 1:3]], 'row')
    
    eve = sc.Dataset(data={'eve': sc.arange('row', 5.0)})
    d = sc.merge(d, eve)
    
    sc.table(d)
    

    Exercise 3#

    Add the sum of the data for alice and bob as a new variable (column) to the dataset.

    Solution 3#

    [27]:
    
    d['sum'] = d['alice'] + d['bob']
    sc.table(d)
    

    Interaction with numpy and scalars#

    Values (or variances) in a dataset are exposed in a numpy-compatible buffer format. Direct access to the numpy-like underlying data array is possible using the values and variances properties:

    [28]:
    
    d['eve'].values
    
    [28]:
    
    array([0., 1., 2., 3., 4.])
    
    [29]:
    
    d['alice'].variances
    
    [29]:
    
    array([0.434 , 0.394 , 0.1152, 0.394 , 0.1152])
    

    We can directly hand the buffers to numpy functions:

    [30]:
    
    d['eve'].values = np.exp(d['eve'].values)
    sc.table(d)
    

    Exercise 4#

    1. As above for np.exp applied to the data for Eve, apply a numpy function to the data for Alice.

    2. What happens to the unit and uncertanties when modifying data with external code such as numpy?

    Solution 4#

    [31]:
    
    d['alice'].values = np.sin(d['alice'].values)
    sc.table(d)
    

    Numpy operations are not aware of the unit and uncertainties. Therefore the result is “garbage”, unless the user has ensured herself that units and uncertainties are handled manually.

    Corollary: Whenever available, built-in operators and functions should be preferred over the use of numpy: these will handle units and uncertanties for you.

    Exercise 5#

    1. Try adding a scalar value such as 1.5 to the values for 'eve' or and 'alice'.

    2. Try the same using the data property, instead of the values property. Why is it not working for 'alice'?

    Solution 5#

    [32]:
    
    d['eve'].values += 1.5
    d['alice'].values += 1.5
    sc.table(d)
    

    Instead of values we can use the data property. This will also correctly deal with variances, if applicable, whereas the direction operation with values is unaware of the presence of variances:

    [33]:
    
    d['eve'].data += 1.5
    

    The data for Alice has a unit, so a direct addition with a dimensionless quantity fails:

    [34]:
    
    try:
        d['alice'].data += 1.5
    except RuntimeError as e:
        print(str(e))
    
    Cannot add m^2 and dimensionless.
    

    We can use Variable to provide a scalar quantity with attached unit:

    [35]:
    
    scale = sc.scalar(1.5, unit=sc.units.m*sc.units.m)
    

    As a short-hand for creating a scalar variable, just multiply a value by a unit:

    [36]:
    
    scale = 1.5 * (sc.units.m*sc.units.m)
    d['alice'].data += scale
    
    sc.table(d)
    

    Continue to Part 2 - Multi-dimensional datasets to see how datasets are used with multi-dimensional data.