Audience: Advanced beginner / intermediate (requires basic knowledge of scipp.DataArray)

Objectives: Constructing binned data from an event list, modifying binning in order to extract different quantities, and basic masking and filtering of event and binned data.

RHESSI Solar Flares#

This tutorial covers the basics of binned data in Scipp by analyzing the list of solar flares recorded by NASA’s RHESSI small explorer [Lin et al].

The input data has been constructed from the official flare list. It is available as a HDF5 file in Scipp’s own format and can be downloaded and accessed directly via scipp.data as shown below.

Attention

The tutorial data has been filtered and modified. It should not be used for any actual scientific analyses!

See docs/tutorials/prepare_data_rhessi.py in the scipp source for details.

1 Loading and preprocessing data#

[1]:
import scipp as sc

1.1 Load Flare List#

The data (flares.data) contains a weight for each flare. Initially, all weights are 1.

The most important metadata items are:

  • start_time, end_time: Time interval of flare.

  • peak_time: Date and time of the highest x-ray flux.

  • x, y: Position in the image seen by RHESSI.

  • min_energy, max_energy: Energy band that a flare was observed in. Bands do not overlap.

  • non_solar: The event was flagged as not coming from the Sun.

[2]:
filename = sc.data.rhessi_flares()
flares = sc.io.load_hdf5(filename)
flares
Downloading file 'rhessi_flares.h5' from 'https://public.esss.dk/groups/scipp/scipp/1/rhessi_flares.h5' to '/home/runner/.cache/scipp/1'.
[2]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (4.36 MB)
    • flare: 70253
    • end_time
      (flare)
      datetime64
      s
      2002-02-12T20:48:56, 2002-02-13T06:07:48, ..., 2018-02-09T15:43:32, 2018-02-09T17:17:40
      Values:
      array(['2002-02-12T20:48:56', '2002-02-13T06:07:48', '2002-02-13T09:04:44', ..., '2018-02-09T13:58:32', '2018-02-09T15:43:32', '2018-02-09T17:17:40'], dtype='datetime64[s]')
    • peak_time
      (flare)
      datetime64
      s
      2002-02-12T20:45:06, 2002-02-13T06:05:14, ..., 2018-02-09T15:42:54, 2018-02-09T17:17:26
      Values:
      array(['2002-02-12T20:45:06', '2002-02-13T06:05:14', '2002-02-13T09:04:42', ..., '2018-02-09T13:55:22', '2018-02-09T15:42:54', '2018-02-09T17:17:26'], dtype='datetime64[s]')
    • start_time
      (flare)
      datetime64
      s
      2002-02-12T20:44:08, 2002-02-13T06:03:52, ..., 2018-02-09T15:41:28, 2018-02-09T17:15:56
      Values:
      array(['2002-02-12T20:44:08', '2002-02-13T06:03:52', '2002-02-13T09:02:56', ..., '2018-02-09T13:53:32', '2018-02-09T15:41:28', '2018-02-09T17:15:56'], dtype='datetime64[s]')
    • x
      (flare)
      float64
      4.84813681109535984e-06rad
      604.0, -272.0, ..., -345.0, -268.0
      Values:
      array([ 604., -272., -235., ..., -345., -345., -268.])
    • y
      (flare)
      float64
      4.84813681109535984e-06rad
      -341.0, 390.0, ..., -38.0, -38.0
      Values:
      array([-341., 390., 390., ..., -38., -38., -38.])
    • (flare)
      float64
      𝟙
      1.0, 1.0, ..., 1.0, 1.0
      Values:
      array([1., 1., 1., ..., 1., 1., 1.])
    • citation
      ()
      string
      https://doi.org/10.1023/A:1022428818870
      Values:
      'https://doi.org/10.1023/A:1022428818870'
    • description
      ()
      string
      X-ray flares recorded by NASA's Reuven Ramaty High Energy Solar Spectroscopic Imager (RHESSI) Small Explorer
      Values:
      "X-ray flares recorded by NASA's Reuven Ramaty High Energy Solar Spectroscopic Imager (RHESSI) Small Explorer"
    • detector_efficiency
      ()
      DataArray
      {dims=[x: 3, y: 3], coords=[x, y]}
      Values:
      <scipp.DataArray> Dimensions: Sizes[x:3, y:3, ] Coordinates: x float64 [4.84813681109535984e-06rad] (x [bin-edge]) [-1000, -300, 300, 1000] y float64 [4.84813681109535984e-06rad] (y [bin-edge]) [-600, -200, 200, 600] Data: float64 [dimensionless] (x, y) [0.9, 0.95, ..., 0.94, 0.91]
    • max_energy
      (flare)
      float64
      keV
      12.0, 50.0, ..., 12.0, 12.0
      Values:
      array([12., 50., 12., ..., 12., 12., 12.])
    • min_energy
      (flare)
      float64
      keV
      6.0, 25.0, ..., 6.0, 6.0
      Values:
      array([ 6., 25., 6., ..., 6., 6., 6.])
    • non_solar
      (flare)
      bool
      False, False, ..., False, False
      Values:
      array([False, False, False, ..., False, False, False])
    • url
      ()
      string
      https://hesperia.gsfc.nasa.gov/rhessi3/data-access/rhessi-data/flare-list/index.html
      Values:
      'https://hesperia.gsfc.nasa.gov/rhessi3/data-access/rhessi-data/flare-list/index.html'

1.2 Inspect the Loaded Data#

Begin by inspecting the data array to gain a basic understanding of the data. This task is open-ended, and you can continue when you feel confident that you know what flares contains.

Possible actions:

  • Use scipp.show and scipp.table in addition to the HTML output of the cell above.

  • Extract individual coordinates and attributes using flares.coords['<name>'] and flares.attrs['<name>'].

For guidance, you can answer the following questions. Or find your own.

  • How many flares are there in the dataset?

  • How many flares are flagged as ‘non_solar’?

  • What is the time range of the data?

Tip: Use scipp.sum, scipp.max, and scipp.min with the coordinates and attributes of flares.

[3]:
# YOUR CODE

Solution

Number of flares:

flares.sizes['flare']

Number flagged as non-solar:

flares.attrs['non_solar'].sum()

Time range:

flares.coords['start_time'].min(), flares.coords['end_time'].max()

1.3 Compute Duration#

Calculate the duration of flares as end_time - start_time and store the result as a new coordinate in flares. Remember that flares.coords functions like a Python dict.

[4]:
# YOUR CODE

Solution

Simple:

duration = flares.coords['end_time'] - flares.coords['start_time']
flares.coords['duration'] = duration

Advanced, using a coordinate transformation:

def compute_duration(start_time, end_time):
    return end_time - start_time


flares = flares.transform_coords('duration', {'duration': compute_duration})

What is the combined duration of flares? (Find an appropriate function in https://scipp.github.io/reference/free-functions.html#reduction)

[5]:
# YOUR CODE

Solution

sc.sum(duration).to(unit='D', dtype='float64')

1.4 Create Masks#

Some events in the input data did not originate from the Sun.

There are two options for handling those events, removing them or masking them. You can choose a solution, but the descriptions guide you through masking, which is a method for removing events non-destructively. That is, the masks can be removed later to get the full event list back in order to determine the impact of the masks.

First, store 'non_solar' as a mask and remove it from the attributes. Use flares.attrs and flares.masks which are dict-like objects (similar to flares.coords).

[6]:
# YOUR CODE

Solution

flares.masks['non_solar'] = flares.attrs.pop('non_solar')

Second, there are some flares whose positions could not be determined. Those are stored with x == y == 0 and need to be removed, as well.

Construct a boolean variable by comparing flares.coords['<x_or_y>'] to 0. Note that x and y have unit ‘asec’. This means that you have to construct a ‘0’ with the same unit which can be done using 0 * sc.Unit('asec').

Finally, combine the variables for x and y using mask_x & mask_y and store the result as a new mask in flares.

[7]:
# YOUR CODE

Solution

flares.masks['unknown_position'] = (
    (flares.coords['x'] == 0 * sc.Unit('asec')) &
    (flares.coords['y'] == 0 * sc.Unit('asec')))

How many flares are now masked? (By each mask individually and by the combination.)

Tip: You can sum booleans.

[8]:
# YOUR CODE

Solution

ns_mask = flares.masks['non_solar']
pos_mask = flares.masks['unknown_position']
{
    'non_solar': ns_mask.sum().value,
    'unknown_position': pos_mask.sum().value,
    'combined': sc.sum(ns_mask | pos_mask).value,
}

2 Spatial Distribution#

2.1 Bin by x and y#

Plot the spatial distribution of flares.

Plotting the event list flares would yield a scatter plot which is not particularly useful. A better approach is computing and plotting the density as a function of x and y. This is commonly done by histogramming the events. But Scipp offers an alternative: binning.

Scipp’s ‘binned data’ is similar to a histogram, except that the individual events are preserved. They are simply collected into bins as defined by bin-edge coordinates.

Define bin-edges for x and y, use scipp.bin to create binned data from flares, and plot the result.

Tip:

  • Use scipp.arange or scipp.linspace to construct the edges. Make sure to use the correct unit!

  • Use sc.bin(flares, edges=[<edge_x>, <edge_y>]).

  • Turn your binned data into a histogram before plotting using <binned>.bins.sum().

  • See Plotting 2D data for the relevant parts of the plotting API.

[9]:
# YOUR CODE

Solution

Bin spatially.

spatial = flares.bin(y=sc.linspace('y', -1200, 1200, 100, unit='asec'),
                     x=sc.linspace('x', -1200, 1200, 100, unit='asec'))

Histogram and plot. The arguments to plot are optional but improve the result here.

spatial.hist().plot(aspect='equal',
                    norm='log',
                    labels={
                        'x': 'x [asec]',
                        'y': 'y [asec]'
                    },
                    cmap='inferno')

2.2 Remove Outliers#

The plot shows a lot of outliers. They are not the non-solar events from before because those are not visible in the plot as they were masked out. Instead, the outliers are caused by RHESSI’s electronics or analysis software glitching out and assigning bad positions to the flares. This includes the circular shape even though it looks deceptively Sun-like.

The flare list does not include enough information to exclude all such bad positions. But the instrument can only detect x-rays for \(x \in [-1000~\text{asec}, 1000~\text{asec}]\) and \(y \in [-600~\text{asec}, 600~\text{asec}]\). So to first order, all events outside that range should be removed.

Instead of using a mask as before, use binning this time. Create new bin-edges for x and y with the proper limits and bin the data with them, this will remove all events outside the valid range. Plot the result. (You can either bin the original flares data array or apply new bins to the previously binned array; both via scipp.bin).

[10]:
# YOUR CODE

Solution

Bin into a more narrow range. Bin sizes are chosen such that bins are square.

spatial = flares.bin(y=sc.linspace('y', -600, 600, 90, unit='asec'),
                     x=sc.linspace('x', -1000, 1000, 150, unit='asec'))

And plot like before.

spatial.hist().plot(aspect='equal',
                    norm='log',
                    labels={
                        'x': 'x [asec]',
                        'y': 'y [asec]'
                    },
                    cmap='inferno')

2.3 Correct for Detector Efficiency#

In this tutorial, we assume that the instrument consists of a 3x3 grid of detectors which each record x-rays from distinct directions. (The reality is more complicated, of course. See the wiki for more information.) Furthermore, the tutorial data has been manipulated to simulate different efficiencies of the individual detectors.

The efficiency is available as an attribute:

[11]:
efficiency = flares.attrs['detector_efficiency'].value
efficiency
[11]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.38 KB)
    • x: 3
    • y: 3
    • x
      (x [bin-edge])
      float64
      4.84813681109535984e-06rad
      -1000.000, -300.0, 300.0, 1000.000
      Values:
      array([-1000., -300., 300., 1000.])
    • y
      (y [bin-edge])
      float64
      4.84813681109535984e-06rad
      -600.0, -200.0, 200.0, 600.0
      Values:
      array([-600., -200., 200., 600.])
    • (x, y)
      float64
      𝟙
      0.9, 0.95, ..., 0.94, 0.91
      Values:
      array([[0.9 , 0.95, 0.77], [0.98, 0.89, 0.82], [0.93, 0.94, 0.91]])

Normalize the data by dividing by efficiency. You first need to bin into the correct bins as defined by the coordinates of efficiency.

[12]:
# YOUR CODE

Solution

coarse_spatial = spatial.bin(x=efficiency.coords['x'],
                             y=efficiency.coords['y'])
corrected = coarse_spatial / efficiency

You can plot the corrected data as before. But it makes sense to return to smaller bins in order to resolve the distribution properly. This can be done using scipp.bin with finer edges because the data was only binned and not histogrammed.

[13]:
# YOUR CODE

Solution

Bin in the same way as before but using corrected as input

spatial = corrected.bin(y=sc.linspace('y', -600, 600, 90, unit='asec'),
                        x=sc.linspace('x', -1000, 1000, 150, unit='asec'))

And plot like before.

spatial.hist().plot(aspect='equal',
                    norm='log',
                    labels={
                        'x': 'x [asec]',
                        'y': 'y [asec]'
                    },
                    cmap='inferno')

3 Time Series#

Now, we want to look at the distribution of flares over time. The temporal distribution can be obtained like the spatial one using binning.

Select one of the time coordinates (e.g. ‘peak_time’) and bin with an appropriate bin size. Plot the result.

Important: Use the pre-binned data from the previous task in order to include the detector normalization.

Tip:

  • The time coordinates are event-coordinates, that is, they are not defined at the top level of the binned data (i.e. per bin) but inside of the bins (i.e. per event). You can access them using <binned>.bins.coords['peak_time'].

  • You can add binning by time using <binned>.bin(peak_time=<n_bins>) and produce a three-dimensional array. But here, we are more interested in a one-dimensional distribution. So we need to erase the binning in x and y which can be achieved, e.g., using <binned>.bins.concat('x').

    Alternatively, the lower level function scipp.binning.make_binned can be used to bin in time and erase spatial binning at the same time.

[14]:
# YOUR CODE

Solution

Bin in time and keep the existing spatial binning. transpose is used to put the time dimension on the slider in the plot.

temporal_and_spatial = spatial.bin(peak_time=200)
to_plot = temporal_and_spatial.hist()
to_plot.transpose(['peak_time', 'y', 'x']).plot(vmin=0.0 * sc.units.one,
                                                vmax=0.66 * sc.units.one,
                                                labels={
                                                    'x': 'x [asec]',
                                                    'y': 'y [asec]'
                                                },
                                                cmap='inferno')

Or remove the existing binning, yielding a 1D-variable.

temporal = spatial.bins.concat('x').bins.concat('y').bin(peak_time=200)
temporal.hist().plot(ylabel='number of flares')

Note that it is important to first remove the binning in x and y by using concat. Binning in time first and then removing other binning would use too much memory. It is possible to do everything in one step using a lower-level API:

from scipp.binning import make_binned

time = spatial.bins.coords['peak_time']
min_time = time.min().value
max_time = time.max().value
step = (max_time - min_time) / 200
time_edges = sc.arange('peak_time',
                       min_time,
                       max_time,
                       step,
                       unit=time.bins.unit)
temporal = make_binned(spatial, edges=[time_edges], erase=('x', 'y'))
temporal.hist().plot(ylabel='number of flares')

3.1 Flare Durations#

Another interesting quantity to look at is the duration of flares. The duration was computed earlier and should already be stored as an event coordinate.

Plot the duration as a function of time.

Tip:

  • Construct a new data array from the previous result: duration = <binned_by_time>.copy(). And assign new data using duration.bins.data = <duration_data>.

  • Previously, we used .hist() (which is an alias for .bins.sum()) to make histograms because the data was given as counts. Now the data is seconds which should be averaged instead of summed. So use .bins.mean() instead.

[15]:
# YOUR CODE

Solution

Create a new data array containing the duration as its data.

duration = temporal.copy()
duration.name = 'duration'
duration.bins.data = duration.bins.coords.pop('duration')

Plot duration vs time.

duration.bins.mean().plot()

4 Energy Bands#

The flares were recorded in several non-overlapping energy bands. Those are identified by the ‘min_energy’ and ‘max_energy’ attributes in flares. Since the bands do not overlap, it is sufficient to label them with ‘min_energy’ for simplicity.

In this section, we want to split the temporal distribution from above into the different energy bands. Group the temporal distribution by ‘min_energy’ to obtain two-dimensional data. Use scipp.group instead of scipp.bin this time because every event has exactly one of a set of possible energies instead of a value in a range.

[16]:
# YOUR CODE

Solution

Skipping the lowest energy band here because those events are not confirmed to be solar flares. This is optional.

grouped_by_energy = temporal.group('min_energy')['min_energy', 1:]

Plot the result. A 2D plot is not very useful here, so split the data by ‘min_energy’ and either plot each energy in a separate plot or combine them into a dictionary and plot that:

lines = {'<name>': <data_array>}
sc.plot(lines)
[17]:
# YOUR CODE

Solution

lines = {
    f"min_energy={grouped_by_energy['min_energy', i].attrs['min_energy'].value}":
    grouped_by_energy['min_energy', i].bins.sum()
    for i in range(grouped_by_energy.sizes['min_energy'])
}
sc.plot(lines, ylabel='number of flares')

References#

Lin, R., Dennis, B., Hurford, G. et al. The Reuven Ramaty High-Energy Solar Spectroscopic Imager (RHESSI). Sol Phys 210, 3–32 (2002). doi:10.1023/A:1022428818870