{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# GroupBy\n", "\n", "\"Group by\" refers to an implementation of the \"split-apply-combine\" approach known from [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) and [xarray](http://xarray.pydata.org/en/stable/groupby.html).\n", "Scipp currently supports only a limited number of operations that can be applied.\n", "\n", "## Grouping based on label values\n", "\n", "Suppose we have measured data for a number of parameter values, potentially repeating measurements with the same parameter multiple times:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "import numpy as np\n", "import scipp as sc\n", "\n", "np.random.seed(0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "param = sc.Variable(dims=['x'], values=[1,3,1,1,5,3])\n", "values = sc.Variable(dims=['x', 'y'], values=np.random.rand(6,16))\n", "values += 1.0 + param" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we store this data as a data array we obtain the following plot:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = sc.DataArray(\n", " values,\n", " coords={\n", " 'x': sc.Variable(dims=['x'], values=np.arange(6)),\n", " 'y': sc.Variable(dims=['y'], values=np.arange(16))\n", " })\n", "sc.plot(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we chose the \"measured\" values such that the three distinct values of the underlying parameter are visible.\n", "We can now use the split-apply-combine mechanism to transform our data into a more useful representation.\n", "We start by storing the parameter values (or any value to be used for grouping) as a non-dimension coordinate:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.coords['param'] = param" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we call `scipp.groupby` to split the data and call `mean` on each of the groups:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped = sc.groupby(data, group='param').mean('x')\n", "sc.plot(grouped)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apart from `mean`, `groupby` also supports `sum`, `concatenate`, and more. See [GroupByDataArray](../generated/classes/scipp.GroupByDataArray.rst) and [GroupByDataset](../generated/classes/scipp.GroupByDataset.rst) for a full list.\n", "\n", "## Grouping based on binned label values\n", "\n", "Grouping based on non-dimension coordinate values (also known as labels) is most useful when labels are strings or integers.\n", "If labels are floating-point values or cover a wide range, it is more convenient to group values into bins, i.e., all values within certain bounds are mapped into the same group.\n", "We modify the above example to use a contiuously-valued parameter:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "param = sc.Variable(dims=['x'], values=np.random.rand(16))\n", "values = sc.Variable(dims=['x', 'y'], values=np.random.rand(16,16))\n", "values += 1.0 + 5.0*param" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = sc.DataArray(\n", " values,\n", " coords={\n", " 'x': sc.Variable(dims=['x'], values=np.arange(16)),\n", " 'y': sc.Variable(dims=['y'], values=np.arange(16))\n", " })\n", "sc.plot(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a variable defining the desired binning:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bins = sc.Variable(dims=[\"z\"], values=np.linspace(0.0, 1.0, 10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, we can now use `groupby` and `mean` to transform the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.coords['param'] = param\n", "grouped = sc.groupby(data, group='param', bins=bins).mean('x')\n", "sc.plot(grouped)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The values in the white rows are `NaN`.\n", "This is the result of empty bins, which do not have a meaningful mean value." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, grouping can be done based on groups defined as Variables rather than strings. This, however, requires bins to be specified, since bins define the new dimension label." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "grouped = sc.groupby(data, group=param, bins=bins).mean('x') # note the lack of quotes around param!\n", "sc.plot(grouped)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Usage examples\n", "\n", "### Filtering a variable using `groupby.copy`\n", "\n", "Apart from reduction operations discussed above, `groupby` also supports `copy`, which allows us to extract a group without changes.\n", "We can use this, e.g., to filter data.\n", "This can be used for filtering variables:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var = sc.array(dims=['x'], values=np.random.rand(100))\n", "select = var < 0.5 * sc.Unit('')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We proceed as follows:\n", "\n", "1. Create a helper data array with a dummy coord that will be used to group the data elements.\n", "2. Call `groupby`, grouping by the `dummy` coord. Here `select` contains two distinct values, `False` and `True`, so `groupby` returns an object with two groups.\n", "2. Pass `1` to `copy` to extract the second group (group indices start at 0) which contains all elements where the dummy coord value is `True`.\n", "3. Finally, the `data` property returns only the filtered variable without the temporary coords that were required for `groupby`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "helper = sc.DataArray(var, coords={'dummy':select})\n", "grouped = sc.groupby(helper, group='dummy')\n", "filtered_var = grouped.copy(1).data\n", "filtered_var" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we can also avoid the named helpers `helper` and `grouped` and write:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filtered_var = sc.groupby(sc.DataArray(var, coords={'dummy':select}), group='dummy').copy(1).data" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.10" } }, "nbformat": 4, "nbformat_minor": 4 }