# Rearranging and Filtering Binned Data

Event filtering refers to the process of removing or extracting a subset of events based on some criterion such as the temperature of the measured sample at the time an event was detected.
Instead of extracting based on a single parameter value or interval, we may also want to rearrange data based on the parameter value, providing quick and convenient access to the parameter-dependence of our data.
Scipp's binned data can be used for both of these purposes.

The [Quick Reference](#Quick-Reference) below provides a brief overview of the options.
A more detailed walkthrough based on actual data can be found in the [Full example](#Full-example).

## Quick Reference

### Extract events matching parameter value

Use [label-based indexing on the bins property](../../generated/classes/scipp.Bins.rst#scipp.Bins).
This works similar to regular [label-based indexing](../slicing.ipynb#Label-based-indexing) but operates on the unordered bin contents.
Example:

```python
param_value = sc.scalar(1.2, unit='m')
filtered = da.bins['param', param_value]
```

- The output data array has the same dimensions as the input `da`.
- `filtered` contains a *copy* of the filtered events.

### Extract events falling into a parameter interval

Use [label-based indexing on the bins property](../../generated/classes/scipp.Bins.rst#scipp.Bins).
This works similar to regular [label-based indexing](../slicing.ipynb#Label-based-indexing) but operates on the unordered bin contents.
Example:

```python
start = sc.scalar(1.2, unit='m')
stop = sc.scalar(1.3, unit='m')
filtered = da.bins['param', start:stop]
```

- The output data array has the same dimensions as the input `da`.
`filtered` contains a *copy* of the filtered events.
- Note that as usual the upper bound of the interval (here $1.3~\text{m}$) is *not* included.

### Split into bins based on a discrete event parameter

Use [scipp.group](../../generated/functions/scipp.group.rst).
Example:

```python
split = da.group('param')
```

- The output data array has a new dimension `'param'` in addition to the dimensions of the input.
- `split` contains a *copy* of the reordered events.
- Pass an explicit variable to `group` listing desired groups to limit what is included in the output.

### Split into bins based on a continuous event parameter

Use [scipp.bin](../../generated/functions/scipp.bin.rst).
Example:

```python
split = da.bin(param=10)
```

- The output data array has a new dimension `'param'` in addition to the dimensions of the input.
- `split` contains a *copy* of the reordered events.
- Provide an explicit variable to `bin` to limit the parameter interval that is included in the output, or for fine-grained control over the sub-intervals.

### Compute derived event parameters for subsequent extracting or splitting

Use [scipp.transform_coords](../../generated/functions/scipp.transform_coords.rst).
Example:

```python
da2 = da.transform_coords(derived_param=lambda p1, p2: p1 + p2)
```

`da2` can now be used with any of the methods for exctracting or splitting data described above.
The intermediate variable can also be omitted, and we can directly extract or split the result:

```python
filtered = da.transform_coords(derived_param=lambda p1, p2: p1 + p2) \
             .bin(new_param=10)
```

### Compute derived event parameters from time-series or other metadata

In practice, events are often tagged with a timestamp, which can be used to lookup parameter values from, e.g., a time-series log given by a data array with a single dimension and a coordinate matching the coordinate name of the timestamps.
Use [scipp.lookup](../../generated/functions/scipp.lookup.rst) with [scipp.transform_coords](../../generated/functions/scipp.transform_coords.rst). Example:

```python
temperature = da.attrs['sample_temperature'].value  # temperature value time-series
interp_temperature = sc.lookup(temperature, mode='previous')
filtered = da.transform_coords(temperature=interp_temperature) \
             .bin(temperature=10)
```

## Full example

### Input data

In the following we use neutron diffraction data for a stainless steel tensile bar in a loadframe measured at the [VULCAN Engineering Materials Diffractometer](https://neutrons.ornl.gov/vulcan), kindly provided by the SNS.
Scipp's sample data includes an excerpt from the full dataset:

In [None]:
import scipp as sc

da = sc.data.vulcan_steel_strain_data()
da

The `dspacing` dimension is the interplanar lattice spacing (the spacing between planes in a crystal), and plotting this data we see two diffraction peaks:

In [None]:
da.hist().plot()

### Extract time interval

The [mechanical strain](https://en.wikipedia.org/wiki/Strain_(mechanics)) of the steel sample in the loadframe is recorded in the metadata:

In [None]:
strain = da.attrs['loadframe.strain'].value
strain.plot()

We see that the strain drops off for some reason at the end.
We can filter out those events, by extracting the rest as outlined in [Extract events matching parameter value](#Extract-events-matching-parameter-value):

In [None]:
import numpy as np

start = strain.coords['time'][0]
stop = strain.coords['time'][np.argmax(strain.values)]
da = da.bins['time', start:stop]

<div class="alert alert-info">

**Note**
    
The above is just a concise way of binning into a single time interval and squeezing the time dimension from the result.

If *multiple* intervals are to be extracted then the mechanism based on `start` and `stop` values becomes highly inefficient, as every time `da.bins['param', start:stop]` is called *all* of the events have to be processed.
Instead prefer using `da.bin(param=param_bin_edges)` and slice the result using regular positional (or label-based) indexing.
Similarly, prefer using `da.group('param')` to extract based on multiple discrete values.
    
</div>

### Filter bad pulses

In the previous example we directly used an existing event-coordinate (`da.bins.coords['time']`) for selecting the desired subset of data.
In many practical cases such a coordinate may not be available yet and needs to be computed as a preparatory step.
Scipp facilitates this using [scipp.transform_coords](../../generated/functions/scipp.transform_coords.rst) and [scipp.lookup](../../generated/functions/scipp.lookup.rst).
When the desired event-coordinate can be computed directly from existing coordinates then `transform_coords` can do the job on its own.
In other cases, such as the following example, we combine it with `lookup` to, e.g., map timestamps to corresponding sensor readings.

Our data stores the so called *proton charge*, the total charge of protons per pulse (which produced the neutrons scattered off the sample):

In [None]:
proton_charge = da.attrs['proton_charge'].value
proton_charge.plot()

Some pulses have a very low proton charge which may indicate a problem with the source, so we may want to remove events that were produced from these pulses.
We can use `lookup` to define the following "interpolation function", marking any pulse as "good" if it has more than 90% of the mean proton charge:

In [None]:
good_pulse = sc.lookup(proton_charge > 0.9 * proton_charge.mean(), mode='previous')

`transform_coords` can utilize this interpolation function to compute a new coordinate (`good_pulse`, with `True` and `False` values) from the `da.bins.coords['time']` coordinate.
We used `mode='previous'` above, so an event's `good_pulse` value will be defined by the *previous* pulse, i.e., the one that produced the neutron event.
See the documentation of [scipp.lookup](../../generated/functions/scipp.lookup.rst) for a full list of available options.

The return value of `transform_coords` can then be used to index the `bins` property, here to extract only the events that have `good_pulse=True`, i.e., were created by a proton pulse that fulfilled the above critereon:

In [None]:
da = da.transform_coords(good_pulse=good_pulse) \
       .bins['good_pulse', sc.index(True)]
da

### Rearrange data based on strain

As outlined in [Split into bins based on a continuous event parameter](#Split-into-bins-based-on-a-continuous-event-parameter) we can rearrange data based on the current strain value at the time a neutron event was recorded:

In [None]:
interpolate_strain = sc.lookup(strain, mode='previous')
filtered = da.transform_coords(strain=interpolate_strain) \
             .bin(strain=100)

We can histogram and plot the result, but the figure is not very illuminating.
This illustrates a common problem, and we will show below how to address it:

In [None]:
filtered.hist().transpose().plot()

The problem we run into with the above figure is that we have a lot more data (events) at zero strain than for the other strain values.
We should therefore *normalize* the result to the incident flux.
In this simplified example we can use the proton charge to accomplish this.
We compute the integrated proton charge for a given strain value:

In [None]:
proton_charge = da.attrs['proton_charge'].value
charge_per_time_interval = proton_charge.bin(time=strain.coords['time'])
charge_per_time_interval.coords['strain'] = strain.data
charge_per_strain_value = charge_per_time_interval.bin(strain=filtered.coords['strain']).hist()

We can then normalize our data and obtain a more meaningful plot.
It is now clearly visible how one of the diffraction peaks splits into two under increasing mechanical strain on the sample:

In [None]:
normalized = (filtered/charge_per_strain_value)
normalized.hist(strain=30, dspacing=300).plot()

As we have reordered our data by strain, data for a specific strain value can now be *cheaply* extracted using positional or label-based indexing.
This returns a view of the events, i.e., not a copy:

In [None]:
normalized['strain', 80].hist(dspacing=400).plot()

In practice it can be useful to integrate strain ranges for comparison in a 1-D plot.
Here we simply histogram with a coarser strain binning:

In [None]:
coarse = normalized.hist(strain=6, dspacing=200)
strains = [coarse['strain', sc.scalar(x)] for x in [0.0, 0.3, 0.6]]
lines = {f"strain={strain.attrs['strain'].values}": strain for strain in strains}
sc.plot(lines, norm='log')