{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GroupBy\n",
    "\n",
    "\"Group by\" refers to an implementation of the \"split-apply-combine\" approach known from [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) and [xarray](http://xarray.pydata.org/en/stable/groupby.html).\n",
    "Scipp currently supports only a limited number of operations that can be applied.\n",
    "\n",
    "## Grouping based on label values\n",
    "\n",
    "Suppose we have measured data for a number of parameter values, potentially repeating measurements with the same parameter multiple times:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import scipp as sc\n",
    "\n",
    "np.random.seed(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "param = sc.Variable(dims=['x'], values=[1,3,1,1,5,3])\n",
    "values = sc.Variable(dims=['x', 'y'], values=np.random.rand(6,16))\n",
    "values += 1.0 + param"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we store this data as a data array we obtain the following plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = sc.DataArray(\n",
    "    values,\n",
    "    coords={\n",
    "        'x': sc.Variable(dims=['x'], values=np.arange(6)),\n",
    "        'y': sc.Variable(dims=['y'], values=np.arange(16))\n",
    "    })\n",
    "sc.plot(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we chose the \"measured\" values such that the three distinct values of the underlying parameter are visible.\n",
    "We can now use the split-apply-combine mechanism to transform our data into a more useful representation.\n",
    "We start by storing the parameter values (or any value to be used for grouping) as a non-dimension coordinate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.coords['param'] = param"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we call `scipp.groupby` to split the data and call `mean` on each of the groups:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = sc.groupby(data, group='param').mean('x')\n",
    "sc.plot(grouped)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apart from `mean`, `groupby` also supports `sum`, `concatenate`, and more. See [GroupByDataArray](../generated/classes/scipp.GroupByDataArray.rst) and [GroupByDataset](../generated/classes/scipp.GroupByDataset.rst) for a full list.\n",
    "\n",
    "## Grouping based on binned label values\n",
    "\n",
    "Grouping based on non-dimension coordinate values (also known as labels) is most useful when labels are strings or integers.\n",
    "If labels are floating-point values or cover a wide range, it is more convenient to group values into bins, i.e., all values within certain bounds are mapped into the same group.\n",
    "We modify the above example to use a contiuously-valued parameter:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "param = sc.Variable(dims=['x'], values=np.random.rand(16))\n",
    "values = sc.Variable(dims=['x', 'y'], values=np.random.rand(16,16))\n",
    "values += 1.0 + 5.0*param"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = sc.DataArray(\n",
    "    values,\n",
    "    coords={\n",
    "        'x': sc.Variable(dims=['x'], values=np.arange(16)),\n",
    "        'y': sc.Variable(dims=['y'], values=np.arange(16))\n",
    "    })\n",
    "sc.plot(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We create a variable defining the desired binning:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bins = sc.Variable(dims=[\"z\"], values=np.linspace(0.0, 1.0, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As before, we can now use `groupby` and `mean` to transform the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.coords['param'] = param\n",
    "grouped = sc.groupby(data, group='param', bins=bins).mean('x')\n",
    "sc.plot(grouped)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The values in the white rows are `NaN`.\n",
    "This is the result of empty bins, which do not have a meaningful mean value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, grouping can be done based on groups defined as Variables rather than strings. This, however, requires bins to be specified, since bins define the new dimension label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = sc.groupby(data, group=param, bins=bins).mean('x') # note the lack of quotes around param!\n",
    "sc.plot(grouped)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Usage examples\n",
    "\n",
    "### Filtering a variable using `groupby.copy`\n",
    "\n",
    "Apart from reduction operations discussed above, `groupby` also supports `copy`, which allows us to extract a group without changes.\n",
    "We can use this, e.g., to filter data.\n",
    "This can be used for filtering variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "var = sc.array(dims=['x'], values=np.random.rand(100))\n",
    "select = var < 0.5 * sc.Unit('')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We proceed as follows:\n",
    "\n",
    "1. Create a helper data array with a dummy coord that will be used to group the data elements.\n",
    "2. Call `groupby`, grouping by the `dummy` coord.  Here `select` contains two distinct values, `False` and `True`, so `groupby` returns an object with two groups.\n",
    "2. Pass `1` to `copy` to extract the second group (group indices start at 0) which contains all elements where the dummy coord value is `True`.\n",
    "3. Finally, the `data` property returns only the filtered variable without the temporary coords that were required for `groupby`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "helper = sc.DataArray(var, coords={'dummy':select})\n",
    "grouped = sc.groupby(helper, group='dummy')\n",
    "filtered_var = grouped.copy(1).data\n",
    "filtered_var"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we can also avoid the named helpers `helper` and `grouped` and write:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filtered_var = sc.groupby(sc.DataArray(var, coords={'dummy':select}), group='dummy').copy(1).data"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}