1-D datasets and tables
Contents
1-D datasets and tables#
Multi-dimensional data arrays with labeled dimensions using scipp
scipp
is heavily inspired by xarray
. While for many applications xarray
is certainly more suitable (and definitely much more matured) than scipp
, there is a number of features missing in other situations. If your use case requires one or several of the items on the following list, using scipp
may be worth considering:
Handling of physical units.
Propagation of uncertainties.
Support for histograms, i.e. bin-edge axes, which are by 1 longer than the data extent.
Support for event data, a particular form of sparse data with 1-D (or N-D) arrays of random-length lists, with very small list entries.
Written in C++ for better performance (for certain applications), in combination with Python bindings.
This notebook demonstrates key functionality and usage of the scipp
library. See the documentation for more information.
Getting started#
What is a Dataset
?#
The central data container in scipp
is called a Dataset
. There are two basic analogies to aid in thinking about a Dataset
:
As a
dict
ofnumpy.ndarray
s, with the addition of named dimensions and units.As a table.
Creating a dataset#
[1]:
import numpy as np
import scipp as sc
We start by creating an empty dataset:
[2]:
d = sc.Dataset()
d
[2]:
Using Dataset
as a table#
We can think about, and indeed use a dataset as a table. This will demonstrate the basic ways of creating datasets and interacting with them. Columns can be added one by one:
[3]:
d['alice'] = sc.Variable(dims=['row'], values=[1.0,1.1,1.2],
variances=[0.01,0.01,0.02], unit=sc.units.m)
sc.table(d)
The column for 'alice'
contains two sub-columns with values and associated variances (uncertainties). The uncertainties are optional.
The datatype (dtype
) is derived from the provided data, so passing np.arange(3)
will yield a variable (column) containing 64-bit integers.
For many practicle purposes we want to associate a set of values (optionally a unit and variances also) with our dimension. Lets introduce a coordinate for row
so that we can assign a row number starting at zero.
[4]:
d.coords['row'] = sc.Variable(dims=['row'], values=np.arange(3))
sc.table(d)
Here the coord acts as a row header for the table. Note that the coordinate is itself just a variable.
More details of the dataset are visible in its string representation:
[5]:
d
[5]:
- row: 3
- row(row)int64𝟙0, 1, 2
Values:
array([0, 1, 2])
- alice(row)float64m1.0, 1.1, 1.2σ = 0.1, 0.1, 0.141
Values:
array([1. , 1.1, 1.2])
Variances (σ²):
array([0.01, 0.01, 0.02])
A data item (column) in a dataset (table) is identified by its name ('alice'
). Note how each coordinate and data item is associated with named dimensions, in this case 'row'
, and also a shape:
[6]:
print(d.coords['row'].dims)
print(d.coords['row'].shape)
print(d['alice'].dims)
print(d['alice'].shape)
('row',)
(3,)
('row',)
(3,)
It is important to understand the difference between items in a dataset, the variable that holds the data of the item, and the actual values. The following illustrates the differences:
[7]:
sc.table(d['alice']) # includes coordinates
sc.table(d['alice'].data) # the variable holding the data, i.e., the dimension labels, units, values, and optional variances
sc.table(d['alice'].values) # just the array of values, shorthand for d['alice'].data.values
Each variable (column) comes with a physical unit, which we should set up correctly as early as possible:
[8]:
print(d.coords['row'].unit)
print(d['alice'].unit)
dimensionless
m
[9]:
d.coords['row'].unit = sc.units.s
sc.table(d)
Units and uncertainties are handled automatically in operations:
[10]:
d *= d
sc.table(d)
Note how the coordinate is unchanged by this operations. As a rule, operations compare coordinates (and fail if there is a mismatch).
Operations between columns are supported by indexing into a dataset with a name:
[11]:
d['bob'] = d['alice']
sc.table(d)
d
[11]:
- row: 3
- row(row)int64s0, 1, 2
Values:
array([0, 1, 2])
- alice(row)float64m^21.0, 1.210, 1.44σ = 0.141, 0.156, 0.24
Values:
array([1. , 1.21, 1.44])
Variances (σ²):
array([0.02 , 0.0242, 0.0576]) - bob(row)float64m^21.0, 1.210, 1.44σ = 0.141, 0.156, 0.24
Values:
array([1. , 1.21, 1.44])
Variances (σ²):
array([0.02 , 0.0242, 0.0576])
For small datasets, the show()
function provides a quick graphical preview on the structure of a dataset:
[12]:
sc.show(d)
[13]:
d['bob'] += d['alice']
sc.table(d)
The contents of a dataset can also be displayed on a graph using the plot
function:
[14]:
sc.plot(d)
This plot demonstrates the advantage of “labeled” data, provided by a dataset: Axes are automatically labeled and multiple items identified by their name are plotted. Furthermore, scipp’s support for units and uncertainties means that all relevant information is directly included in a default plot.
Operations between rows are supported by indexing into a dataset with a dimension label and an index.
Slicing dimensions behaves similar to numpy
: If a single index is given, the dimension is dropped, if a range is given, the dimension is kept. For a Dataset
, in the former case the corresponding coordinates are dropped, whereas in the latter case it is preserved:
[15]:
a = np.arange(8)
[16]:
a[4]
[16]:
4
[17]:
a[4:5]
[17]:
array([4])
[18]:
d['row', 1] += d['row', 2]
sc.table(d)
Note the key advantage over numpy
or MATLAB
: We specify the index dimension, so we always know which dimension we are slicing. The advantage is not so apparent in 1D, but will become clear once we move to higher-dimensional data.
Summary#
There is a number of ways to select and operate on a single row, a range of rows, a single variable (column) or multiple variables (columns) of a dataset:
[19]:
# Single row (dropping corresponding coordinates)
sc.table(d['row', 0])
# Size-1 row range (keeping corresponding coordinates)
sc.table(d['row', 0:1])
# Range of rows
sc.table(d['row', 1:3])
# Single column (column pair if variance is present) including coordinate columns
sc.table(d["alice"])
# Single variable (column pair if variance is present)
sc.table(d["alice"].data)
# Column values without header
sc.table(d["alice"].values)
Exercise 1#
Combining row slicing and “column” indexing, add the last row of the data for
'alice'
to the first row of data for'bob'
.Using the slice-range notation
a:b
, try adding the last two rows to the first two rows. Why does this fail?
Solution 1#
[20]:
d['bob']['row', 0] += d['alice']['row', -1]
sc.table(d)
If a range is given when slicing, the corresponding coordinate is preserved, and operations between misaligned data is prevented.
[21]:
try:
d['bob']['row', 0:2] += d['alice']['row', 1:3]
except RuntimeError as e:
print(str(e))
Mismatch in coordinate 'row' in operation 'add_equals':
(row: 2) int64 [s] [0, 1]
vs
(row: 2) int64 [s] [1, 2]
To circumvent the safety catch we can operate on the underlying variables containing the data. The data is accessed using the data
property:
[22]:
d['bob']['row', 0:2].data += d['alice']['row', 1:3].data
sc.table(d)
Exercise 2#
The slicing notation for variables (columns) and rows does not return a copy, but a view object. This is very similar to how numpy
operates:
[23]:
a_slice = a[0:3]
a_slice += 100
a
[23]:
array([100, 101, 102, 3, 4, 5, 6, 7])
Using the slicing notation, create a new table (or replace the existing dataset d
) by one that does not contain the first and last row of d
.
Solution 2#
[24]:
d2 = d['row', 1:-1].copy()
# Or:
# from copy import copy
# table = copy(d['row', 1:-1])
sc.table(d2)
Note that the call to copy()
is essential. If it is omitted we simply have a view onto the same data, and the orignal data is modified if the view is modified:
[25]:
just_a_view = d['row', 1:-1]
sc.to_html(just_a_view)
just_a_view['alice'].values[0] = 666
sc.table(d)
- row: 1
- row(row)int64s1
Values:
array([1])
- alice(row)float64m^211.060σ = 0.628
Values:
array([11.06])
Variances (σ²):
array([0.394]) - bob(row)float64m^211.060σ = 0.628
Values:
array([11.06])
Variances (σ²):
array([0.394])
Appending rows and columns#
We can append rows using concat
, and add columns using merge
:
[26]:
d = sc.concat([d['row', 0:3], d['row', 1:3]], 'row')
eve = sc.Dataset(data={'eve': sc.arange('row', 5.0)})
d = sc.merge(d, eve)
sc.table(d)
Exercise 3#
Add the sum of the data for alice
and bob
as a new variable (column) to the dataset.
Solution 3#
[27]:
d['sum'] = d['alice'] + d['bob']
sc.table(d)
Interaction with numpy
and scalars#
Values (or variances) in a dataset are exposed in a numpy
-compatible buffer format. Direct access to the numpy
-like underlying data array is possible using the values
and variances
properties:
[28]:
d['eve'].values
[28]:
array([0., 1., 2., 3., 4.])
[29]:
d['alice'].variances
[29]:
array([0.434 , 0.394 , 0.1152, 0.394 , 0.1152])
We can directly hand the buffers to numpy
functions:
[30]:
d['eve'].values = np.exp(d['eve'].values)
sc.table(d)
Exercise 4#
As above for
np.exp
applied to the data for Eve, apply anumpy
function to the data for Alice.What happens to the unit and uncertanties when modifying data with external code such as
numpy
?
Solution 4#
[31]:
d['alice'].values = np.sin(d['alice'].values)
sc.table(d)
Numpy operations are not aware of the unit and uncertainties. Therefore the result is “garbage”, unless the user has ensured herself that units and uncertainties are handled manually.
Corollary: Whenever available, built-in operators and functions should be preferred over the use of numpy
: these will handle units and uncertanties for you.
Exercise 5#
Try adding a scalar value such as
1.5
to thevalues
for'eve'
or and'alice'
.Try the same using the
data
property, instead of thevalues
property. Why is it not working for'alice'
?
Solution 5#
[32]:
d['eve'].values += 1.5
d['alice'].values += 1.5
sc.table(d)
Instead of values
we can use the data
property. This will also correctly deal with variances, if applicable, whereas the direction operation with values
is unaware of the presence of variances:
[33]:
d['eve'].data += 1.5
The data
for Alice has a unit, so a direct addition with a dimensionless quantity fails:
[34]:
try:
d['alice'].data += 1.5
except RuntimeError as e:
print(str(e))
Cannot add m^2 and dimensionless.
We can use Variable
to provide a scalar quantity with attached unit:
[35]:
scale = sc.scalar(1.5, unit=sc.units.m*sc.units.m)
As a short-hand for creating a scalar variable, just multiply a value by a unit:
[36]:
scale = 1.5 * (sc.units.m*sc.units.m)
d['alice'].data += scale
sc.table(d)
Continue to Part 2 - Multi-dimensional datasets to see how datasets are used with multi-dimensional data.