Audience: Advanced beginner / intermediate (requires basic knowledge of scipp.DataArray)

Objectives: Constructing binned data from an event list, modifying binning in order to extract different quantities, and basic masking and filtering of event and binned data.

RHESSI Solar Flares#

This tutorial covers the basics of binned data in Scipp by analyzing the list of solar flares recorded by NASA’s RHESSI small explorer [Lin et al].

The input data has been constructed from the official flare list. It is available as a HDF5 file in Scipp’s own format and can be downloaded and accessed directly via scipp.data as shown below.

Attention

The tutorial data has been filtered and modified. It should not be used for any actual scientific analyses!

See docs/getting-started/tutorials/prepare_data_rhessi.py in the scipp source for details.

1 Loading and preprocessing data#

[1]:
import scipp as sc

1.1 Load Flare List#

The file contains a number of items that are loaded together as a data group. - flares: The actual flare list. - non_solar: Indicates whether a flare event actually originated from a source that is not the Sun. - detector_efficiency: Relative efficiencies of the detector components. This is faked for the purposes of this tutorial! The official flare list is already corrected for such effects.

[2]:
filename = sc.data.rhessi_flares()
flare_datagroup = sc.io.load_hdf5(filename)
flare_datagroup
Downloading file 'rhessi_flares.h5' from 'https://public.esss.dk/groups/scipp/scipp/2/rhessi_flares.h5' to '/home/runner/.cache/scipp/2'.
[2]:
  • flares
    scipp
    DataArray
    (flare: 70253)
    float64
    𝟙
    1.0, 1.0, ..., 1.0, 1.0
  • non_solar
    scipp
    Variable
    (flare: 70253)
    bool
    False, False, ..., False, False
  • detector_efficiency
    scipp
    DataArray
    (x: 3, y: 3)
    float64
    𝟙
    0.9, 0.95, ..., 0.94, 0.91
  • citation
    str
    ()
    https://doi.org/10.1023/A:1022428818870
  • description
    str
    ()
    X-ray flares recorded by NASA's Reuven Ramaty High Energy Solar Spectroscopic Im...
  • url
    str
    ()
    https://hesperia.gsfc.nasa.gov/rhessi3/data-access/rhessi-data/flare-list/index....

For ease of use, we extract the flares data array. Its data (flares.data) contains a weight for each flare. Initially, all weights are 1.

Its coordinates have the following meanings:

  • start_time, end_time: Time interval of flare.

  • peak_time: Date and time of the highest x-ray flux.

  • x, y: Position in the image seen by RHESSI.

  • min_energy, max_energy: Energy band that a flare was observed in. Bands do not overlap.

[3]:
flares = flare_datagroup['flares']
flares
[3]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (4.29 MB)
    • flare: 70253
    • end_time
      (flare)
      datetime64
      s
      2002-02-12T20:48:56, 2002-02-13T06:07:48, ..., 2018-02-09T15:43:32, 2018-02-09T17:17:40
      Values:
      array(['2002-02-12T20:48:56', '2002-02-13T06:07:48', '2002-02-13T09:04:44', ..., '2018-02-09T13:58:32', '2018-02-09T15:43:32', '2018-02-09T17:17:40'], dtype='datetime64[s]')
    • max_energy
      (flare)
      float64
      keV
      12.0, 50.0, ..., 12.0, 12.0
      Values:
      array([12., 50., 12., ..., 12., 12., 12.])
    • min_energy
      (flare)
      float64
      keV
      6.0, 25.0, ..., 6.0, 6.0
      Values:
      array([ 6., 25., 6., ..., 6., 6., 6.])
    • peak_time
      (flare)
      datetime64
      s
      2002-02-12T20:45:06, 2002-02-13T06:05:14, ..., 2018-02-09T15:42:54, 2018-02-09T17:17:26
      Values:
      array(['2002-02-12T20:45:06', '2002-02-13T06:05:14', '2002-02-13T09:04:42', ..., '2018-02-09T13:55:22', '2018-02-09T15:42:54', '2018-02-09T17:17:26'], dtype='datetime64[s]')
    • start_time
      (flare)
      datetime64
      s
      2002-02-12T20:44:08, 2002-02-13T06:03:52, ..., 2018-02-09T15:41:28, 2018-02-09T17:15:56
      Values:
      array(['2002-02-12T20:44:08', '2002-02-13T06:03:52', '2002-02-13T09:02:56', ..., '2018-02-09T13:53:32', '2018-02-09T15:41:28', '2018-02-09T17:15:56'], dtype='datetime64[s]')
    • x
      (flare)
      float64
      4.84813681109535984e-06rad
      604.0, -272.0, ..., -345.0, -268.0
      Values:
      array([ 604., -272., -235., ..., -345., -345., -268.])
    • y
      (flare)
      float64
      4.84813681109535984e-06rad
      -341.0, 390.0, ..., -38.0, -38.0
      Values:
      array([-341., 390., 390., ..., -38., -38., -38.])
    • (flare)
      float64
      𝟙
      1.0, 1.0, ..., 1.0, 1.0
      Values:
      array([1., 1., 1., ..., 1., 1., 1.])

1.2 Inspect the Loaded Data#

Begin by inspecting the data group and flares data array to gain a basic understanding of the data. This task is open-ended, and you can continue when you feel confident that you know what flares (and flare_datagroup) contains.

Possible actions:

  • Use scipp.show and scipp.table in addition to the HTML output of the cell above.

  • Inspect individual coordinates using flares.coords['<name>'].

For guidance, you can answer the following questions. Or find your own.

  • How many flares are there in the dataset?

  • How many flares are flagged as ‘non_solar’?

  • What is the time range of the data?

Tip: Use scipp.sum, scipp.max, and scipp.min with the coordinates of flares and items of flare_datagroup.

[4]:
# YOUR CODE

Solution#

[5]:
# Number of flares
flares.sizes['flare']
[5]:
70253
[6]:
# Number of flares flagged as non-solar
flare_datagroup['non_solar'].sum()
[6]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (264 Bytes)
    • ()
      int64
      253
      Values:
      array(253)
[7]:
# Start of time range
flares.coords['start_time'].min()
[7]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (264 Bytes)
    • ()
      datetime64
      s
      2002-02-12T20:44:08
      Values:
      array('2002-02-12T20:44:08', dtype='datetime64[s]')
[8]:
# End of time range
flares.coords['end_time'].max()
[8]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (264 Bytes)
    • ()
      datetime64
      s
      2018-02-09T17:17:40
      Values:
      array('2018-02-09T17:17:40', dtype='datetime64[s]')

1.3 Compute Duration#

Calculate the duration of flares as end_time - start_time and store the result as a new coordinate in flares. Remember that flares.coords functions like a Python dict.

[9]:
# YOUR CODE

Solution#

Simple:

[10]:
duration = flares.coords['end_time'] - flares.coords['start_time']
flares.coords['duration'] = duration

Advanced, using a coordinate transformation:

[11]:
def compute_duration(start_time, end_time):
    return end_time - start_time


flares = flares.transform_coords('duration', {'duration': compute_duration})
[12]:
flares
[12]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (4.83 MB)
    • flare: 70253
    • duration
      (flare)
      int64
      s
      288, 236, ..., 124, 104
      Values:
      array([288, 236, 108, ..., 300, 124, 104])
    • end_time
      (flare)
      datetime64
      s
      2002-02-12T20:48:56, 2002-02-13T06:07:48, ..., 2018-02-09T15:43:32, 2018-02-09T17:17:40
      Values:
      array(['2002-02-12T20:48:56', '2002-02-13T06:07:48', '2002-02-13T09:04:44', ..., '2018-02-09T13:58:32', '2018-02-09T15:43:32', '2018-02-09T17:17:40'], dtype='datetime64[s]')
    • max_energy
      (flare)
      float64
      keV
      12.0, 50.0, ..., 12.0, 12.0
      Values:
      array([12., 50., 12., ..., 12., 12., 12.])
    • min_energy
      (flare)
      float64
      keV
      6.0, 25.0, ..., 6.0, 6.0
      Values:
      array([ 6., 25., 6., ..., 6., 6., 6.])
    • peak_time
      (flare)
      datetime64
      s
      2002-02-12T20:45:06, 2002-02-13T06:05:14, ..., 2018-02-09T15:42:54, 2018-02-09T17:17:26
      Values:
      array(['2002-02-12T20:45:06', '2002-02-13T06:05:14', '2002-02-13T09:04:42', ..., '2018-02-09T13:55:22', '2018-02-09T15:42:54', '2018-02-09T17:17:26'], dtype='datetime64[s]')
    • start_time
      (flare)
      datetime64
      s
      2002-02-12T20:44:08, 2002-02-13T06:03:52, ..., 2018-02-09T15:41:28, 2018-02-09T17:15:56
      Values:
      array(['2002-02-12T20:44:08', '2002-02-13T06:03:52', '2002-02-13T09:02:56', ..., '2018-02-09T13:53:32', '2018-02-09T15:41:28', '2018-02-09T17:15:56'], dtype='datetime64[s]')
    • x
      (flare)
      float64
      4.84813681109535984e-06rad
      604.0, -272.0, ..., -345.0, -268.0
      Values:
      array([ 604., -272., -235., ..., -345., -345., -268.])
    • y
      (flare)
      float64
      4.84813681109535984e-06rad
      -341.0, 390.0, ..., -38.0, -38.0
      Values:
      array([-341., 390., 390., ..., -38., -38., -38.])
    • (flare)
      float64
      𝟙
      1.0, 1.0, ..., 1.0, 1.0
      Values:
      array([1., 1., 1., ..., 1., 1., 1.])

What is the combined duration of flares? (Find an appropriate function in https://scipp.github.io/reference/free-functions.html#reduction)

[13]:
# YOUR CODE

Solution#

[14]:
sc.sum(duration).to(unit='D', dtype='float64')
[14]:
Show/Hide data repr Show/Hide attributes
scipp.Variable (264 Bytes)
    • ()
      float64
      D
      375.4030555555555
      Values:
      array(375.40305556)

1.4 Create Masks#

Some events in the input data did not originate from the Sun.

There are two options for handling those events, removing them or masking them. You can choose a solution, but the descriptions guide you through masking, which is a method for removing events non-destructively. That is, the masks can be removed later to get the full event list back in order to determine the impact of the masks.

Non-solar mask#

First, store 'non_solar' as a mask. Use flare_datagroup and flares.masks which are dict-like objects (similar to flares.coords).

[15]:
# YOUR CODE

Solution#

[16]:
flares.masks['non_solar'] = flare_datagroup['non_solar']

Unknown position mask#

Second, there are some flares whose positions could not be determined. Those are stored with x == y == 0 and need to be removed, as well.

Construct a boolean variable by comparing flares.coords['<x_or_y>'] to 0. Note that x and y have unit ‘asec’. This means that you have to construct a ‘0’ with the same unit which can be done using 0 * sc.Unit('asec').

Finally, combine the variables for x and y using mask_x & mask_y and store the result as a new mask in flares.

[17]:
# YOUR CODE

Solution#

[18]:
flares.masks['unknown_position'] = (flares.coords['x'] == 0 * sc.Unit('asec')) & (
    flares.coords['y'] == 0 * sc.Unit('asec')
)

How many flares are now masked? (By each mask individually and by the combination.)

Tip: You can sum booleans.

[19]:
# YOUR CODE

Solution#

[20]:
ns_mask = flares.masks['non_solar']
pos_mask = flares.masks['unknown_position']
{
    'non_solar': ns_mask.sum().value,
    'unknown_position': pos_mask.sum().value,
    'combined': sc.sum(ns_mask | pos_mask).value,
}
[20]:
{'non_solar': np.int64(253),
 'unknown_position': np.int64(703),
 'combined': np.int64(703)}

2 Spatial Distribution#

2.1 Bin by x and y#

Plot the spatial distribution of flares.

Plotting the event list flares would yield a scatter plot which is not particularly useful. A better approach is computing and plotting the density as a function of x and y. This is commonly done by histogramming the events. But Scipp offers an alternative: binning.

Scipp’s ‘binned data’ is similar to a histogram, except that the individual events are preserved. They are simply collected into bins as defined by bin-edge coordinates.

Define bin-edges for x and y, use scipp.bin to create binned data from flares, and plot the result.

Tip:

  • Use scipp.arange or scipp.linspace to construct the edges. Make sure to use the correct unit!

  • Use sc.bin(flares, edges=[<edge_x>, <edge_y>]).

  • Turn your binned data into a histogram before plotting using <binned>.bins.sum().

  • See the Plopp documentation for the relevant parts of the plotting API.

[21]:
# YOUR CODE

Solution#

Bin spatially.

[22]:
spatial = flares.bin(
    y=sc.linspace('y', -1200, 1200, 100, unit='asec'),
    x=sc.linspace('x', -1200, 1200, 100, unit='asec'),
)

Histogram and plot. The arguments to plot are optional but improve the result here.

[23]:
p = spatial.hist().plot(aspect='equal', norm='log', cmap='inferno')
p.canvas.xlabel = 'x [asec]'
p.canvas.ylabel = 'y [asec]'
p
[23]:
../../_images/getting-started_tutorials_solar_flares_58_0.svg

2.2 Remove Outliers#

The plot shows a lot of outliers. They are not the non-solar events from before because those are not visible in the plot as they were masked out. Instead, the outliers are caused by RHESSI’s electronics or analysis software glitching out and assigning bad positions to the flares. This includes the circular shape even though it looks deceptively Sun-like.

The flare list does not include enough information to exclude all such bad positions. But the instrument can only detect x-rays for \(x \in [-1000~\text{asec}, 1000~\text{asec}]\) and \(y \in [-600~\text{asec}, 600~\text{asec}]\). So to first order, all events outside that range should be removed.

Instead of using a mask as before, use binning this time. Create new bin-edges for x and y with the proper limits and bin the data with them, this will remove all events outside the valid range. Plot the result. (You can either bin the original flares data array or apply new bins to the previously binned array; both via scipp.bin).

[24]:
# YOUR CODE

Solution#

Bin into a more narrow range. Bin sizes are chosen such that bins are square.

[25]:
spatial = flares.bin(
    y=sc.linspace('y', -600, 600, 90, unit='asec'),
    x=sc.linspace('x', -1000, 1000, 150, unit='asec'),
)

And plot like before.

[26]:
p = spatial.hist().plot(aspect='equal', norm='log', cmap='inferno')
p.canvas.xlabel = 'x [asec]'
p.canvas.ylabel = 'y [asec]'
p
[26]:
../../_images/getting-started_tutorials_solar_flares_66_0.svg

2.3 Correct for Detector Efficiency#

In this tutorial, we assume that the instrument consists of a 3x3 grid of detectors which each record x-rays from distinct directions. (The reality is more complicated, of course.) Furthermore, the tutorial data has been manipulated to simulate different efficiencies of the individual detectors.

The efficiency is available in the input data group:

[27]:
efficiency = flare_datagroup['detector_efficiency']
efficiency
[27]:
Show/Hide data repr Show/Hide attributes
scipp.DataArray (1.38 KB)
    • x: 3
    • y: 3
    • x
      (x [bin-edge])
      float64
      4.84813681109535984e-06rad
      -1000.000, -300.0, 300.0, 1000.000
      Values:
      array([-1000., -300., 300., 1000.])
    • y
      (y [bin-edge])
      float64
      4.84813681109535984e-06rad
      -600.0, -200.0, 200.0, 600.0
      Values:
      array([-600., -200., 200., 600.])
    • (x, y)
      float64
      𝟙
      0.9, 0.95, ..., 0.94, 0.91
      Values:
      array([[0.9 , 0.95, 0.77], [0.98, 0.89, 0.82], [0.93, 0.94, 0.91]])

Normalize the data by dividing by efficiency. You first need to bin into the correct bins as defined by the coordinates of efficiency.

[28]:
# YOUR CODE

Solution#

[29]:
coarse_spatial = spatial.bin(x=efficiency.coords['x'], y=efficiency.coords['y'])
corrected = coarse_spatial / efficiency

You can plot the corrected data as before. But it makes sense to return to smaller bins in order to resolve the distribution properly. This can be done using scipp.bin with finer edges because the data was only binned and not histogrammed.

[30]:
# YOUR CODE

Solution#

Bin in the same way as before but using corrected as input

[31]:
spatial = corrected.bin(
    y=sc.linspace('y', -600, 600, 90, unit='asec'),
    x=sc.linspace('x', -1000, 1000, 150, unit='asec'),
)

And plot like before.

[32]:
p = spatial.hist().plot(aspect='equal', norm='log', cmap='inferno')
p.canvas.xlabel = 'x [asec]'
p.canvas.ylabel = 'y [asec]'
p
[32]:
../../_images/getting-started_tutorials_solar_flares_82_0.svg

3 Time Series#

Now, we want to look at the distribution of flares over time. The temporal distribution can be obtained like the spatial one using binning.

3.1 Bin on time coordinate#

Select one of the time coordinates (e.g. ‘peak_time’) and bin with an appropriate bin size. Plot the result.

Important: Use the pre-binned data from the previous task in order to include the detector normalization.

Tip:

  • The time coordinates are event-coordinates, that is, they are not defined at the top level of the binned data (i.e. per bin) but inside of the bins (i.e. per event). You can access them using <binned>.bins.coords['peak_time'].

  • You can add binning by time using <binned>.bin(peak_time=<n_bins>) and produce a three-dimensional array. But here, we are more interested in a one-dimensional distribution. So we need to erase the binning in x and y which can be achieved, e.g., using <binned>.bins.concat('x').

    Alternatively, the lower level function scipp.binning.make_binned can be used to bin in time and erase spatial binning at the same time.

[33]:
# YOUR CODE

Solution#

Here we keep the existing binning and plot time slices of the distribution of the events using the plopp.slicer tool. The slicer returns two widgets, a figure and a slider, to change properties of the figure we access the widget at index 0. Notice that the color scale is kept fixed by setting vmin and vmax keyword arguments in the call to plot.

[34]:
%matplotlib widget
import plopp as pp

temporal_and_spatial = spatial.bin(peak_time=10)
p = pp.slicer(
    temporal_and_spatial.hist(),
    keep=['y', 'x'],
    vmin=sc.scalar(1),
    autoscale='fixed',
    aspect='equal',
    cmap='magma',
    norm='log',
)
p.canvas.xlabel = 'x [asec]'
p.canvas.ylabel = 'y [asec]'
p
[34]:

Or you can remove the existing binning and bin in time to yield a 1D-variable representing the number of flares in each time bin.

[35]:
temporal = spatial.bins.concat('x').bins.concat('y').bin(peak_time=200)
temporal.hist().plot()
[35]:

Note that it is important to first remove the binning in x and y by using concat. Binning in time first and then removing other binning would use too much memory. It is possible to do everything in one step using a lower-level API:

[36]:
from scipp.binning import make_binned

time = spatial.bins.coords['peak_time']
min_time = time.min().value
max_time = time.max().value
step = (max_time - min_time) / 200
time_edges = sc.arange('peak_time', min_time, max_time, step, unit=time.bins.unit)
temporal = make_binned(spatial, edges=[time_edges], erase=('x', 'y'))
temporal.hist().plot()
[36]:

3.2 Flare Durations#

Another interesting quantity to look at is the duration of flares. The duration was computed earlier and should already be stored as an event coordinate.

Plot the duration as a function of time.

Tip:

  • Construct a new data array from the previous result: duration = <binned_by_time>.copy(). And assign new data using duration.bins.data = <duration_data>.

  • Previously, we used .hist() (which is an alias for .bins.sum()) to make histograms because the data was given as counts. Now the data is seconds which should be averaged instead of summed. So use .bins.mean() instead.

[37]:
# YOUR CODE

Solution#

Create a new data array containing the duration as its data.

[38]:
duration = temporal.copy()
duration.name = 'duration'
duration.bins.data = duration.bins.coords.pop('duration')

Plot duration vs time.

[39]:
duration.bins.mean().plot()
[39]:

4 Energy Bands#

The flares were recorded in several non-overlapping energy bands. Those are identified by the ‘min_energy’ and ‘max_energy’ coordinates in flares. Since the bands do not overlap, it is sufficient to label them with ‘min_energy’ for simplicity.

4.1 Group by energy content#

In this section, we want to split the temporal distribution from above into the different energy bands. Group the temporal distribution by ‘min_energy’ to obtain two-dimensional data. Use scipp.group instead of scipp.bin this time because every event has exactly one of a set of possible energies instead of a value in a range.

[40]:
# YOUR CODE

Solution#

Skipping the lowest energy band here because those events are not confirmed to be solar flares. This is optional.

[41]:
grouped_by_energy = temporal.group('min_energy')['min_energy', 1:]

Plot the result. A 2D plot is not very useful here, so split the data by ‘min_energy’ and either plot each energy in a separate plot or combine them into a dictionary and plot that:

lines = {'<name>': <data_array>}
sc.plot(lines)
[42]:
# YOUR CODE

Solution#

[43]:
lines = {
    f"min_energy={energy.value} {energy.unit}": grouped_by_energy[
        'min_energy', energy
    ].bins.sum()
    for energy in grouped_by_energy.coords['min_energy']
}
sc.plot(lines)
[43]:

References#

Lin, R., Dennis, B., Hurford, G. et al. The Reuven Ramaty High-Energy Solar Spectroscopic Imager (RHESSI). Sol Phys 210, 3–32 (2002). doi:10.1023/A:1022428818870