{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1-D datasets and tables\n",
    "\n",
    "**Multi-dimensional data arrays with labeled dimensions using scipp**\n",
    "\n",
    "`scipp` is heavily inspired by `xarray`. While for many applications `xarray` is certainly more suitable (and definitely much more matured) than `scipp`, there is a number of features missing in other situations. If your use case requires one or several of the items on the following list, using `scipp` may be worth considering:\n",
    "\n",
    "- Handling of physical units.\n",
    "- Propagation of uncertainties.\n",
    "- Support for histograms, i.e. bin-edge axes, which are by 1 longer than the data extent.\n",
    "- Support for event data, a particular form of sparse data with 1-D (or N-D) arrays of random-length lists, with very small list entries.\n",
    "- Written in C++ for better performance (for certain applications), in combination with Python bindings.\n",
    "\n",
    "This notebook demonstrates key functionality and usage of the `scipp` library. See the [documentation](https://scipp.github.io/) for more information.\n",
    "\n",
    "## Getting started\n",
    "### What is a `Dataset`?\n",
    "The central data container in `scipp` is called a `Dataset`.\n",
    "There are two basic analogies to aid in thinking about a `Dataset`:\n",
    "\n",
    "1. As a `dict` of `numpy.ndarray`s, with the addition of named dimensions and units.\n",
    "2. As a table.\n",
    "\n",
    "### Creating a dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import scipp as sc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start by creating an empty dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = sc.Dataset()\n",
    "d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using `Dataset` as a table\n",
    "We can think about, and indeed use a dataset as a table.\n",
    "This will demonstrate the basic ways of creating datasets and interacting with them.\n",
    "Columns can be added one by one:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['alice'] = sc.Variable(dims=['row'], values=[1.0,1.1,1.2],\n",
    "                         variances=[0.01,0.01,0.02], unit=sc.units.m)\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The column for `'alice'` contains two sub-columns with values and associated variances (uncertainties).\n",
    "The uncertainties are optional.\n",
    "\n",
    "The datatype (`dtype`) is derived from the provided data, so passing `np.arange(3)` will yield a variable (column) containing 64-bit integers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For many practicle purposes we want to associate a set of values (optionally a unit and variances also) with our dimension. Lets introduce a coordinate for `row` so that we can assign a row number starting at zero."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.coords['row'] = sc.Variable(dims=['row'], values=np.arange(3))\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here the coord acts as a row header for the table. Note that the coordinate is itself just a variable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More details of the dataset are visible in its string representation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A data item (column) in a dataset (table) is identified by its name (`'alice'`).\n",
    "Note how each coordinate and data item is associated with named dimensions, in this case `'row'`, and also a shape:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(d.coords['row'].dims)\n",
    "print(d.coords['row'].shape)\n",
    "print(d['alice'].dims)\n",
    "print(d['alice'].shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is important to understand the difference between items in a dataset, the variable that holds the data of the item, and the actual values.\n",
    "The following illustrates the differences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.table(d['alice']) # includes coordinates\n",
    "sc.table(d['alice'].data) # the variable holding the data, i.e., the dimension labels, units, values, and optional variances\n",
    "sc.table(d['alice'].values) # just the array of values, shorthand for d['alice'].data.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each variable (column) comes with a physical unit, which we should set up correctly as early as possible:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(d.coords['row'].unit)\n",
    "print(d['alice'].unit)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d.coords['row'].unit = sc.units.s\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Units and uncertainties are handled automatically in operations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d *= d\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how the coordinate is unchanged by this operations.\n",
    "As a rule, operations *compare* coordinates (and fail if there is a mismatch).\n",
    "\n",
    "Operations between columns are supported by indexing into a dataset with a name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['bob'] = d['alice']\n",
    "sc.table(d)\n",
    "d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For small datasets, the `show()` function provides a quick graphical preview on the structure of a dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.show(d)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['bob'] += d['alice']\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The contents of a dataset can also be displayed on a graph using the `plot` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.plot(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This plot demonstrates the advantage of \"labeled\" data, provided by a dataset:\n",
    "Axes are automatically labeled and multiple items identified by their name are plotted.\n",
    "Furthermore, scipp's support for units and uncertainties means that all relevant information is directly included in a default plot.\n",
    "\n",
    "Operations between rows are supported by indexing into a dataset with a dimension label and an index.\n",
    "\n",
    "Slicing dimensions behaves similar to `numpy`:\n",
    "If a single index is given, the dimension is dropped, if a range is given, the dimension is kept.\n",
    "For a `Dataset`, in the former case the corresponding coordinates are dropped, whereas in the latter case it is preserved:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = np.arange(8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[4:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['row', 1] += d['row', 2]\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note the key advantage over `numpy` or `MATLAB`:\n",
    "We specify the index dimension, so we always know which dimension we are slicing.\n",
    "The advantage is not so apparent in 1D, but will become clear once we move to higher-dimensional data.\n",
    "\n",
    "### Summary\n",
    "\n",
    "There is a number of ways to select and operate on a single row, a range of rows, a single variable (column) or multiple variables (columns) of a dataset: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "code_folding": []
   },
   "outputs": [],
   "source": [
    "# Single row (dropping corresponding coordinates)\n",
    "sc.table(d['row', 0])\n",
    "# Size-1 row range (keeping corresponding coordinates)\n",
    "sc.table(d['row', 0:1])\n",
    "# Range of rows\n",
    "sc.table(d['row', 1:3])\n",
    "# Single column (column pair if variance is present) including coordinate columns\n",
    "sc.table(d[\"alice\"])\n",
    "# Single variable (column pair if variance is present)\n",
    "sc.table(d[\"alice\"].data)\n",
    "# Column values without header\n",
    "sc.table(d[\"alice\"].values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1\n",
    "1. Combining row slicing and \"column\" indexing, add the last row of the data for `'alice'` to the first row of data for `'bob'`.\n",
    "2. Using the slice-range notation `a:b`, try adding the last two rows to the first two rows. Why does this fail?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['bob']['row', 0] += d['alice']['row', -1]\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If a range is given when slicing, the corresponding coordinate is preserved, and operations between misaligned data is prevented."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    d['bob']['row', 0:2] += d['alice']['row', 1:3]\n",
    "except RuntimeError as e:\n",
    "    print(str(e))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To circumvent the safety catch we can operate on the underlying variables containing the data.\n",
    "The data is accessed using the `data` property:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['bob']['row', 0:2].data += d['alice']['row', 1:3].data\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2\n",
    "\n",
    "The slicing notation for variables (columns) and rows does not return a copy, but a view object.\n",
    "This is very similar to how `numpy` operates:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a_slice = a[0:3]\n",
    "a_slice += 100\n",
    "a"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the slicing notation, create a new table (or replace the existing dataset `d`) by one that does not contain the first and last row of `d`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d2 = d['row', 1:-1].copy()\n",
    "\n",
    "# Or:\n",
    "# from copy import copy\n",
    "# table = copy(d['row', 1:-1])\n",
    "\n",
    "sc.table(d2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the call to `copy()` is essential.\n",
    "If it is omitted we simply have a view onto the same data, and the orignal data is modified if the view is modified:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "just_a_view = d['row', 1:-1]\n",
    "sc.to_html(just_a_view)\n",
    "just_a_view['alice'].values[0] = 666\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Appending rows and columns\n",
    "We can append rows using `concatenate`, and add columns using `merge`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = sc.concatenate(d['row', 0:3], d['row', 1:3], 'row')\n",
    "\n",
    "eve = sc.Dataset(data={'eve': sc.arange('row', 5.0)})\n",
    "d = sc.merge(d, eve)\n",
    "\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3\n",
    "Add the sum of the data for `alice` and `bob` as a new variable (column) to the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['sum'] = d['alice'] + d['bob']\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interaction with `numpy` and scalars\n",
    "\n",
    "Values (or variances) in a dataset are exposed in a `numpy`-compatible buffer format.\n",
    "Direct access to the `numpy`-like underlying data array is possible using the `values` and `variances` properties:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['eve'].values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['alice'].variances"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can directly hand the buffers to `numpy` functions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['eve'].values = np.exp(d['eve'].values)\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 4\n",
    "1. As above for `np.exp` applied to the data for Eve, apply a `numpy` function to the data for Alice.\n",
    "2. What happens to the unit and uncertanties when modifying data with external code such as `numpy`?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['alice'].values = np.sin(d['alice'].values)\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Numpy operations are not aware of the unit and uncertainties. Therefore the result is \"garbage\", unless the user has ensured herself that units and uncertainties are handled manually.\n",
    "\n",
    "Corollary: Whenever available, built-in operators and functions should be preferred over the use of `numpy`: these will handle units and uncertanties for you."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 5\n",
    "1. Try adding a scalar value such as `1.5` to the `values` for `'eve'` or and `'alice'`.\n",
    "2. Try the same using the `data` property, instead of the `values` property.\n",
    "   Why is it not working for `'alice'`?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solution 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['eve'].values += 1.5\n",
    "d['alice'].values += 1.5\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of `values` we can use the `data` property.\n",
    "This will also correctly deal with variances, if applicable, whereas the direction operation with `values` is unaware of the presence of variances:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['eve'].data += 1.5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `data` for Alice has a unit, so a direct addition with a dimensionless quantity fails:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    d['alice'].data += 1.5\n",
    "except RuntimeError as e:\n",
    "    print(str(e))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use `Variable` to provide a scalar quantity with attached unit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scale = sc.scalar(1.5, unit=sc.units.m*sc.units.m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a short-hand for creating a scalar variable, just multiply a value by a unit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scale = 1.5 * (sc.units.m*sc.units.m)\n",
    "d['alice'].data += scale\n",
    "\n",
    "sc.table(d)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Continue to [Part 2 - Multi-dimensional datasets](multi-d-datasets.ipynb) to see how datasets are used with multi-dimensional data."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}