{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 1-D datasets and tables\n", "\n", "**Multi-dimensional data arrays with labeled dimensions using scipp**\n", "\n", "`scipp` is heavily inspired by `xarray`. While for many applications `xarray` is certainly more suitable (and definitely much more matured) than `scipp`, there is a number of features missing in other situations. If your use case requires one or several of the items on the following list, using `scipp` may be worth considering:\n", "\n", "- Handling of physical units.\n", "- Propagation of uncertainties.\n", "- Support for histograms, i.e. bin-edge axes, which are by 1 longer than the data extent.\n", "- Support for event data, a particular form of sparse data with 1-D (or N-D) arrays of random-length lists, with very small list entries.\n", "- Written in C++ for better performance (for certain applications), in combination with Python bindings.\n", "\n", "This notebook demonstrates key functionality and usage of the `scipp` library. See the [documentation](https://scipp.github.io/) for more information.\n", "\n", "## Getting started\n", "### What is a `Dataset`?\n", "The central data container in `scipp` is called a `Dataset`.\n", "There are two basic analogies to aid in thinking about a `Dataset`:\n", "\n", "1. As a `dict` of `numpy.ndarray`s, with the addition of named dimensions and units.\n", "2. As a table.\n", "\n", "### Creating a dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import scipp as sc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by creating an empty dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d = sc.Dataset()\n", "d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Using `Dataset` as a table\n", "We can think about, and indeed use a dataset as a table.\n", "This will demonstrate the basic ways of creating datasets and interacting with them.\n", "Columns can be added one by one:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['alice'] = sc.Variable(dims=['row'], values=[1.0,1.1,1.2],\n", " variances=[0.01,0.01,0.02], unit=sc.units.m)\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The column for `'alice'` contains two sub-columns with values and associated variances (uncertainties).\n", "The uncertainties are optional.\n", "\n", "The datatype (`dtype`) is derived from the provided data, so passing `np.arange(3)` will yield a variable (column) containing 64-bit integers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For many practicle purposes we want to associate a set of values (optionally a unit and variances also) with our dimension. Lets introduce a coordinate for `row` so that we can assign a row number starting at zero." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d.coords['row'] = sc.Variable(dims=['row'], values=np.arange(3))\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the coord acts as a row header for the table. Note that the coordinate is itself just a variable." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More details of the dataset are visible in its string representation:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A data item (column) in a dataset (table) is identified by its name (`'alice'`).\n", "Note how each coordinate and data item is associated with named dimensions, in this case `'row'`, and also a shape:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(d.coords['row'].dims)\n", "print(d.coords['row'].shape)\n", "print(d['alice'].dims)\n", "print(d['alice'].shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is important to understand the difference between items in a dataset, the variable that holds the data of the item, and the actual values.\n", "The following illustrates the differences:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sc.table(d['alice']) # includes coordinates\n", "sc.table(d['alice'].data) # the variable holding the data, i.e., the dimension labels, units, values, and optional variances\n", "sc.table(d['alice'].values) # just the array of values, shorthand for d['alice'].data.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each variable (column) comes with a physical unit, which we should set up correctly as early as possible:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(d.coords['row'].unit)\n", "print(d['alice'].unit)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d.coords['row'].unit = sc.units.s\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Units and uncertainties are handled automatically in operations:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d *= d\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how the coordinate is unchanged by this operations.\n", "As a rule, operations *compare* coordinates (and fail if there is a mismatch).\n", "\n", "Operations between columns are supported by indexing into a dataset with a name:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['bob'] = d['alice']\n", "sc.table(d)\n", "d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For small datasets, the `show()` function provides a quick graphical preview on the structure of a dataset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sc.show(d)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['bob'] += d['alice']\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The contents of a dataset can also be displayed on a graph using the `plot` function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sc.plot(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This plot demonstrates the advantage of \"labeled\" data, provided by a dataset:\n", "Axes are automatically labeled and multiple items identified by their name are plotted.\n", "Furthermore, scipp's support for units and uncertainties means that all relevant information is directly included in a default plot.\n", "\n", "Operations between rows are supported by indexing into a dataset with a dimension label and an index.\n", "\n", "Slicing dimensions behaves similar to `numpy`:\n", "If a single index is given, the dimension is dropped, if a range is given, the dimension is kept.\n", "For a `Dataset`, in the former case the corresponding coordinates are dropped, whereas in the latter case it is preserved:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = np.arange(8)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[4]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[4:5]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['row', 1] += d['row', 2]\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the key advantage over `numpy` or `MATLAB`:\n", "We specify the index dimension, so we always know which dimension we are slicing.\n", "The advantage is not so apparent in 1D, but will become clear once we move to higher-dimensional data.\n", "\n", "### Summary\n", "\n", "There is a number of ways to select and operate on a single row, a range of rows, a single variable (column) or multiple variables (columns) of a dataset: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "code_folding": [] }, "outputs": [], "source": [ "# Single row (dropping corresponding coordinates)\n", "sc.table(d['row', 0])\n", "# Size-1 row range (keeping corresponding coordinates)\n", "sc.table(d['row', 0:1])\n", "# Range of rows\n", "sc.table(d['row', 1:3])\n", "# Single column (column pair if variance is present) including coordinate columns\n", "sc.table(d[\"alice\"])\n", "# Single variable (column pair if variance is present)\n", "sc.table(d[\"alice\"].data)\n", "# Column values without header\n", "sc.table(d[\"alice\"].values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1\n", "1. Combining row slicing and \"column\" indexing, add the last row of the data for `'alice'` to the first row of data for `'bob'`.\n", "2. Using the slice-range notation `a:b`, try adding the last two rows to the first two rows. Why does this fail?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solution 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['bob']['row', 0] += d['alice']['row', -1]\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a range is given when slicing, the corresponding coordinate is preserved, and operations between misaligned data is prevented." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " d['bob']['row', 0:2] += d['alice']['row', 1:3]\n", "except RuntimeError as e:\n", " print(str(e))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To circumvent the safety catch we can operate on the underlying variables containing the data.\n", "The data is accessed using the `data` property:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['bob']['row', 0:2].data += d['alice']['row', 1:3].data\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2\n", "\n", "The slicing notation for variables (columns) and rows does not return a copy, but a view object.\n", "This is very similar to how `numpy` operates:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a_slice = a[0:3]\n", "a_slice += 100\n", "a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the slicing notation, create a new table (or replace the existing dataset `d`) by one that does not contain the first and last row of `d`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solution 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d2 = d['row', 1:-1].copy()\n", "\n", "# Or:\n", "# from copy import copy\n", "# table = copy(d['row', 1:-1])\n", "\n", "sc.table(d2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the call to `copy()` is essential.\n", "If it is omitted we simply have a view onto the same data, and the orignal data is modified if the view is modified:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "just_a_view = d['row', 1:-1]\n", "sc.to_html(just_a_view)\n", "just_a_view['alice'].values[0] = 666\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Appending rows and columns\n", "We can append rows using `concatenate`, and add columns using `merge`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d = sc.concatenate(d['row', 0:3], d['row', 1:3], 'row')\n", "\n", "eve = sc.Dataset(data={'eve': sc.arange('row', 5.0)})\n", "d = sc.merge(d, eve)\n", "\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3\n", "Add the sum of the data for `alice` and `bob` as a new variable (column) to the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solution 3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['sum'] = d['alice'] + d['bob']\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interaction with `numpy` and scalars\n", "\n", "Values (or variances) in a dataset are exposed in a `numpy`-compatible buffer format.\n", "Direct access to the `numpy`-like underlying data array is possible using the `values` and `variances` properties:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['eve'].values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['alice'].variances" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can directly hand the buffers to `numpy` functions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['eve'].values = np.exp(d['eve'].values)\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 4\n", "1. As above for `np.exp` applied to the data for Eve, apply a `numpy` function to the data for Alice.\n", "2. What happens to the unit and uncertanties when modifying data with external code such as `numpy`?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solution 4" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['alice'].values = np.sin(d['alice'].values)\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Numpy operations are not aware of the unit and uncertainties. Therefore the result is \"garbage\", unless the user has ensured herself that units and uncertainties are handled manually.\n", "\n", "Corollary: Whenever available, built-in operators and functions should be preferred over the use of `numpy`: these will handle units and uncertanties for you." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 5\n", "1. Try adding a scalar value such as `1.5` to the `values` for `'eve'` or and `'alice'`.\n", "2. Try the same using the `data` property, instead of the `values` property.\n", " Why is it not working for `'alice'`?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Solution 5" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['eve'].values += 1.5\n", "d['alice'].values += 1.5\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of `values` we can use the `data` property.\n", "This will also correctly deal with variances, if applicable, whereas the direction operation with `values` is unaware of the presence of variances:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d['eve'].data += 1.5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `data` for Alice has a unit, so a direct addition with a dimensionless quantity fails:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " d['alice'].data += 1.5\n", "except RuntimeError as e:\n", " print(str(e))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `Variable` to provide a scalar quantity with attached unit:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scale = sc.scalar(1.5, unit=sc.units.m*sc.units.m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a short-hand for creating a scalar variable, just multiply a value by a unit:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scale = 1.5 * (sc.units.m*sc.units.m)\n", "d['alice'].data += scale\n", "\n", "sc.table(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Continue to [Part 2 - Multi-dimensional datasets](multi-d-datasets.ipynb) to see how datasets are used with multi-dimensional data." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }