Audience: Advanced beginner / intermediate (requires basic knowledge of scipp.DataArray
)
Objectives: Constructing binned data from an event list, modifying binning in order to extract different quantities, and basic masking and filtering of event and binned data.
RHESSI Solar Flares#
This tutorial covers the basics of binned data in Scipp by analyzing the list of solar flares recorded by NASA’s RHESSI small explorer [Lin et al].
The input data has been constructed from the official flare list. It is available as a HDF5 file in Scipp’s own format and can be downloaded and accessed directly via scipp.data
as shown below.
Attention
The tutorial data has been filtered and modified. It should not be used for any actual scientific analyses!
See docs/getting-started/tutorials/prepare_data_rhessi.py
in the scipp source for details.
1 Loading and preprocessing data#
[1]:
import scipp as sc
1.1 Load Flare List#
The file contains a number of items that are loaded together as a data group. - flares
: The actual flare list. - non_solar
: Indicates whether a flare event actually originated from a source that is not the Sun. - detector_efficiency
: Relative efficiencies of the detector components. This is faked for the purposes of this tutorial! The official flare list is already corrected for such effects.
[2]:
filename = sc.data.rhessi_flares()
flare_datagroup = sc.io.load_hdf5(filename)
flare_datagroup
Downloading file 'rhessi_flares.h5' from 'https://public.esss.dk/groups/scipp/scipp/2/rhessi_flares.h5' to '/home/runner/.cache/scipp/2'.
[2]:
- flaresscippDataArray(flare: 70253)float64𝟙1.0, 1.0, ..., 1.0, 1.0
- non_solarscippVariable(flare: 70253)boolFalse, False, ..., False, False
- detector_efficiencyscippDataArray(x: 3, y: 3)float64𝟙0.9, 0.95, ..., 0.94, 0.91
- citationstr()https://doi.org/10.1023/A:1022428818870
- descriptionstr()X-ray flares recorded by NASA's Reuven Ramaty High Energy Solar Spectroscopic Im...
- urlstr()https://hesperia.gsfc.nasa.gov/rhessi3/data-access/rhessi-data/flare-list/index....
For ease of use, we extract the flares
data array. Its data (flares.data
) contains a weight for each flare. Initially, all weights are 1.
Its coordinates have the following meanings:
start_time
,end_time
: Time interval of flare.peak_time
: Date and time of the highest x-ray flux.x
,y
: Position in the image seen by RHESSI.min_energy
,max_energy
: Energy band that a flare was observed in. Bands do not overlap.
[3]:
flares = flare_datagroup['flares']
flares
[3]:
- flare: 70253
- end_time(flare)datetime64s2002-02-12T20:48:56, 2002-02-13T06:07:48, ..., 2018-02-09T15:43:32, 2018-02-09T17:17:40
Values:
array(['2002-02-12T20:48:56', '2002-02-13T06:07:48', '2002-02-13T09:04:44', ..., '2018-02-09T13:58:32', '2018-02-09T15:43:32', '2018-02-09T17:17:40'], dtype='datetime64[s]') - max_energy(flare)float64keV12.0, 50.0, ..., 12.0, 12.0
Values:
array([12., 50., 12., ..., 12., 12., 12.]) - min_energy(flare)float64keV6.0, 25.0, ..., 6.0, 6.0
Values:
array([ 6., 25., 6., ..., 6., 6., 6.]) - peak_time(flare)datetime64s2002-02-12T20:45:06, 2002-02-13T06:05:14, ..., 2018-02-09T15:42:54, 2018-02-09T17:17:26
Values:
array(['2002-02-12T20:45:06', '2002-02-13T06:05:14', '2002-02-13T09:04:42', ..., '2018-02-09T13:55:22', '2018-02-09T15:42:54', '2018-02-09T17:17:26'], dtype='datetime64[s]') - start_time(flare)datetime64s2002-02-12T20:44:08, 2002-02-13T06:03:52, ..., 2018-02-09T15:41:28, 2018-02-09T17:15:56
Values:
array(['2002-02-12T20:44:08', '2002-02-13T06:03:52', '2002-02-13T09:02:56', ..., '2018-02-09T13:53:32', '2018-02-09T15:41:28', '2018-02-09T17:15:56'], dtype='datetime64[s]') - x(flare)float644.84813681109535984e-06rad604.0, -272.0, ..., -345.0, -268.0
Values:
array([ 604., -272., -235., ..., -345., -345., -268.]) - y(flare)float644.84813681109535984e-06rad-341.0, 390.0, ..., -38.0, -38.0
Values:
array([-341., 390., 390., ..., -38., -38., -38.])
- (flare)float64𝟙1.0, 1.0, ..., 1.0, 1.0
Values:
array([1., 1., 1., ..., 1., 1., 1.])
1.2 Inspect the Loaded Data#
Begin by inspecting the data group and flares data array to gain a basic understanding of the data. This task is open-ended, and you can continue when you feel confident that you know what flares
(and flare_datagroup
) contains.
Possible actions:
Use scipp.show and scipp.table in addition to the HTML output of the cell above.
Inspect individual coordinates using
flares.coords['<name>']
.
For guidance, you can answer the following questions. Or find your own.
How many flares are there in the dataset?
How many flares are flagged as ‘non_solar’?
What is the time range of the data?
Tip: Use scipp.sum, scipp.max, and scipp.min with the coordinates of flares
and items of flare_datagroup
.
[4]:
# YOUR CODE
Solution#
[5]:
# Number of flares
flares.sizes['flare']
[5]:
70253
[6]:
# Number of flares flagged as non-solar
flare_datagroup['non_solar'].sum()
[6]:
- ()int64253
Values:
array(253)
[7]:
# Start of time range
flares.coords['start_time'].min()
[7]:
- ()datetime64s2002-02-12T20:44:08
Values:
array('2002-02-12T20:44:08', dtype='datetime64[s]')
[8]:
# End of time range
flares.coords['end_time'].max()
[8]:
- ()datetime64s2018-02-09T17:17:40
Values:
array('2018-02-09T17:17:40', dtype='datetime64[s]')
1.3 Compute Duration#
Calculate the duration of flares as end_time - start_time
and store the result as a new coordinate in flares
. Remember that flares.coords
functions like a Python dict
.
[9]:
# YOUR CODE
Solution#
Simple:
[10]:
duration = flares.coords['end_time'] - flares.coords['start_time']
flares.coords['duration'] = duration
Advanced, using a coordinate transformation:
[11]:
def compute_duration(start_time, end_time):
return end_time - start_time
flares = flares.transform_coords('duration', {'duration': compute_duration})
[12]:
flares
[12]:
- flare: 70253
- duration(flare)int64s288, 236, ..., 124, 104
Values:
array([288, 236, 108, ..., 300, 124, 104]) - end_time(flare)datetime64s2002-02-12T20:48:56, 2002-02-13T06:07:48, ..., 2018-02-09T15:43:32, 2018-02-09T17:17:40
Values:
array(['2002-02-12T20:48:56', '2002-02-13T06:07:48', '2002-02-13T09:04:44', ..., '2018-02-09T13:58:32', '2018-02-09T15:43:32', '2018-02-09T17:17:40'], dtype='datetime64[s]') - max_energy(flare)float64keV12.0, 50.0, ..., 12.0, 12.0
Values:
array([12., 50., 12., ..., 12., 12., 12.]) - min_energy(flare)float64keV6.0, 25.0, ..., 6.0, 6.0
Values:
array([ 6., 25., 6., ..., 6., 6., 6.]) - peak_time(flare)datetime64s2002-02-12T20:45:06, 2002-02-13T06:05:14, ..., 2018-02-09T15:42:54, 2018-02-09T17:17:26
Values:
array(['2002-02-12T20:45:06', '2002-02-13T06:05:14', '2002-02-13T09:04:42', ..., '2018-02-09T13:55:22', '2018-02-09T15:42:54', '2018-02-09T17:17:26'], dtype='datetime64[s]') - start_time(flare)datetime64s2002-02-12T20:44:08, 2002-02-13T06:03:52, ..., 2018-02-09T15:41:28, 2018-02-09T17:15:56
Values:
array(['2002-02-12T20:44:08', '2002-02-13T06:03:52', '2002-02-13T09:02:56', ..., '2018-02-09T13:53:32', '2018-02-09T15:41:28', '2018-02-09T17:15:56'], dtype='datetime64[s]') - x(flare)float644.84813681109535984e-06rad604.0, -272.0, ..., -345.0, -268.0
Values:
array([ 604., -272., -235., ..., -345., -345., -268.]) - y(flare)float644.84813681109535984e-06rad-341.0, 390.0, ..., -38.0, -38.0
Values:
array([-341., 390., 390., ..., -38., -38., -38.])
- (flare)float64𝟙1.0, 1.0, ..., 1.0, 1.0
Values:
array([1., 1., 1., ..., 1., 1., 1.])
What is the combined duration of flares? (Find an appropriate function in https://scipp.github.io/reference/free-functions.html#reduction)
[13]:
# YOUR CODE
Solution#
[14]:
sc.sum(duration).to(unit='D', dtype='float64')
[14]:
- ()float64D375.4030555555555
Values:
array(375.40305556)
1.4 Create Masks#
Some events in the input data did not originate from the Sun.
There are two options for handling those events, removing them or masking them. You can choose a solution, but the descriptions guide you through masking, which is a method for removing events non-destructively. That is, the masks can be removed later to get the full event list back in order to determine the impact of the masks.
Non-solar mask#
First, store 'non_solar'
as a mask. Use flare_datagroup
and flares.masks
which are dict
-like objects (similar to flares.coords
).
[15]:
# YOUR CODE
Solution#
[16]:
flares.masks['non_solar'] = flare_datagroup['non_solar']
Unknown position mask#
Second, there are some flares whose positions could not be determined. Those are stored with x == y == 0
and need to be removed, as well.
Construct a boolean variable by comparing flares.coords['<x_or_y>']
to 0. Note that x
and y
have unit ‘asec’. This means that you have to construct a ‘0’ with the same unit which can be done using 0 * sc.Unit('asec')
.
Finally, combine the variables for x
and y
using mask_x & mask_y
and store the result as a new mask in flares
.
[17]:
# YOUR CODE
Solution#
[18]:
flares.masks['unknown_position'] = (flares.coords['x'] == 0 * sc.Unit('asec')) & (
flares.coords['y'] == 0 * sc.Unit('asec')
)
How many flares are now masked? (By each mask individually and by the combination.)
Tip: You can sum
booleans.
[19]:
# YOUR CODE
Solution#
[20]:
ns_mask = flares.masks['non_solar']
pos_mask = flares.masks['unknown_position']
{
'non_solar': ns_mask.sum().value,
'unknown_position': pos_mask.sum().value,
'combined': sc.sum(ns_mask | pos_mask).value,
}
[20]:
{'non_solar': np.int64(253),
'unknown_position': np.int64(703),
'combined': np.int64(703)}
2 Spatial Distribution#
2.1 Bin by x and y#
Plot the spatial distribution of flares.
Plotting the event list flares
would yield a scatter plot which is not particularly useful. A better approach is computing and plotting the density as a function of x
and y
. This is commonly done by histogramming the events. But Scipp offers an alternative: binning.
Scipp’s ‘binned data’ is similar to a histogram, except that the individual events are preserved. They are simply collected into bins as defined by bin-edge coordinates.
Define bin-edges for x
and y
, use scipp.bin to create binned data from flares
, and plot the result.
Tip:
Use scipp.arange or scipp.linspace to construct the edges. Make sure to use the correct unit!
Use
sc.bin(flares, edges=[<edge_x>, <edge_y>])
.Turn your binned data into a histogram before plotting using
<binned>.bins.sum()
.See the Plopp documentation for the relevant parts of the plotting API.
[21]:
# YOUR CODE
Solution#
Bin spatially.
[22]:
spatial = flares.bin(
y=sc.linspace('y', -1200, 1200, 100, unit='asec'),
x=sc.linspace('x', -1200, 1200, 100, unit='asec'),
)
Histogram and plot. The arguments to plot
are optional but improve the result here.
[23]:
p = spatial.hist().plot(aspect='equal', norm='log', cmap='inferno')
p.canvas.xlabel = 'x [asec]'
p.canvas.ylabel = 'y [asec]'
p
[23]:
2.2 Remove Outliers#
The plot shows a lot of outliers. They are not the non-solar events from before because those are not visible in the plot as they were masked out. Instead, the outliers are caused by RHESSI’s electronics or analysis software glitching out and assigning bad positions to the flares. This includes the circular shape even though it looks deceptively Sun-like.
The flare list does not include enough information to exclude all such bad positions. But the instrument can only detect x-rays for \(x \in [-1000~\text{asec}, 1000~\text{asec}]\) and \(y \in [-600~\text{asec}, 600~\text{asec}]\). So to first order, all events outside that range should be removed.
Instead of using a mask as before, use binning this time. Create new bin-edges for x
and y
with the proper limits and bin the data with them, this will remove all events outside the valid range. Plot the result. (You can either bin the original flares
data array or apply new bins to the previously binned array; both via scipp.bin).
[24]:
# YOUR CODE
Solution#
Bin into a more narrow range. Bin sizes are chosen such that bins are square.
[25]:
spatial = flares.bin(
y=sc.linspace('y', -600, 600, 90, unit='asec'),
x=sc.linspace('x', -1000, 1000, 150, unit='asec'),
)
And plot like before.
[26]:
p = spatial.hist().plot(aspect='equal', norm='log', cmap='inferno')
p.canvas.xlabel = 'x [asec]'
p.canvas.ylabel = 'y [asec]'
p
[26]:
2.3 Correct for Detector Efficiency#
In this tutorial, we assume that the instrument consists of a 3x3 grid of detectors which each record x-rays from distinct directions. (The reality is more complicated, of course. See the wiki for more information.) Furthermore, the tutorial data has been manipulated to simulate different efficiencies of the individual detectors.
The efficiency is available in the input data group:
[27]:
efficiency = flare_datagroup['detector_efficiency']
efficiency
[27]:
- x: 3
- y: 3
- x(x [bin-edge])float644.84813681109535984e-06rad-1000.000, -300.0, 300.0, 1000.000
Values:
array([-1000., -300., 300., 1000.]) - y(y [bin-edge])float644.84813681109535984e-06rad-600.0, -200.0, 200.0, 600.0
Values:
array([-600., -200., 200., 600.])
- (x, y)float64𝟙0.9, 0.95, ..., 0.94, 0.91
Values:
array([[0.9 , 0.95, 0.77], [0.98, 0.89, 0.82], [0.93, 0.94, 0.91]])
Normalize the data by dividing by efficiency
. You first need to bin into the correct bins as defined by the coordinates of efficiency
.
[28]:
# YOUR CODE
Solution#
[29]:
coarse_spatial = spatial.bin(x=efficiency.coords['x'], y=efficiency.coords['y'])
corrected = coarse_spatial / efficiency
You can plot the corrected data as before. But it makes sense to return to smaller bins in order to resolve the distribution properly. This can be done using scipp.bin with finer edges because the data was only binned and not histogrammed.
[30]:
# YOUR CODE
Solution#
Bin in the same way as before but using corrected
as input
[31]:
spatial = corrected.bin(
y=sc.linspace('y', -600, 600, 90, unit='asec'),
x=sc.linspace('x', -1000, 1000, 150, unit='asec'),
)
And plot like before.
[32]:
p = spatial.hist().plot(aspect='equal', norm='log', cmap='inferno')
p.canvas.xlabel = 'x [asec]'
p.canvas.ylabel = 'y [asec]'
p
[32]:
3 Time Series#
Now, we want to look at the distribution of flares over time. The temporal distribution can be obtained like the spatial one using binning.
3.1 Bin on time coordinate#
Select one of the time coordinates (e.g. ‘peak_time’) and bin with an appropriate bin size. Plot the result.
Important: Use the pre-binned data from the previous task in order to include the detector normalization.
Tip:
The time coordinates are event-coordinates, that is, they are not defined at the top level of the binned data (i.e. per bin) but inside of the bins (i.e. per event). You can access them using
<binned>.bins.coords['peak_time']
.You can add binning by time using
<binned>.bin(peak_time=<n_bins>)
and produce a three-dimensional array. But here, we are more interested in a one-dimensional distribution. So we need to erase the binning inx
andy
which can be achieved, e.g., using<binned>.bins.concat('x')
.Alternatively, the lower level function scipp.binning.make_binned can be used to bin in time and erase spatial binning at the same time.
[33]:
# YOUR CODE
Solution#
Here we keep the existing binning and plot time slices of the distribution of the events using the plopp.slicer
tool. The slicer returns two widgets, a figure and a slider, to change properties of the figure we access the widget at index 0. Notice that the color scale is kept fixed by setting vmin
and vmax
keyword arguments in the call to plot.
[34]:
%matplotlib widget
import plopp as pp
temporal_and_spatial = spatial.bin(peak_time=10)
p = pp.slicer(
temporal_and_spatial.hist(),
keep=['y', 'x'],
vmin=sc.scalar(1),
autoscale='fixed',
aspect='equal',
cmap='magma',
norm='log',
)
p.canvas.xlabel = 'x [asec]'
p.canvas.ylabel = 'y [asec]'
p
[34]:
Or you can remove the existing binning and bin in time to yield a 1D-variable representing the number of flares in each time bin.
[35]:
temporal = spatial.bins.concat('x').bins.concat('y').bin(peak_time=200)
temporal.hist().plot()
[35]:
Note that it is important to first remove the binning in x and y by using concat
. Binning in time first and then removing other binning would use too much memory. It is possible to do everything in one step using a lower-level API:
[36]:
from scipp.binning import make_binned
time = spatial.bins.coords['peak_time']
min_time = time.min().value
max_time = time.max().value
step = (max_time - min_time) / 200
time_edges = sc.arange('peak_time', min_time, max_time, step, unit=time.bins.unit)
temporal = make_binned(spatial, edges=[time_edges], erase=('x', 'y'))
temporal.hist().plot()
[36]:
3.2 Flare Durations#
Another interesting quantity to look at is the duration of flares. The duration was computed earlier and should already be stored as an event coordinate.
Plot the duration as a function of time.
Tip:
Construct a new data array from the previous result:
duration = <binned_by_time>.copy()
. And assign new data usingduration.bins.data = <duration_data>
.Previously, we used
.hist()
(which is an alias for.bins.sum()
) to make histograms because the data was given as counts. Now the data is seconds which should be averaged instead of summed. So use.bins.mean()
instead.
[37]:
# YOUR CODE
Solution#
Create a new data array containing the duration as its data
.
[38]:
duration = temporal.copy()
duration.name = 'duration'
duration.bins.data = duration.bins.coords.pop('duration')
Plot duration vs time.
[39]:
duration.bins.mean().plot()
[39]:
4 Energy Bands#
The flares were recorded in several non-overlapping energy bands. Those are identified by the ‘min_energy’ and ‘max_energy’ coordinates in flares
. Since the bands do not overlap, it is sufficient to label them with ‘min_energy’ for simplicity.
4.1 Group by energy content#
In this section, we want to split the temporal distribution from above into the different energy bands. Group the temporal distribution by ‘min_energy’ to obtain two-dimensional data. Use scipp.group instead of scipp.bin
this time because every event has exactly one of a set of possible energies instead of a value in a range.
[40]:
# YOUR CODE
Solution#
Skipping the lowest energy band here because those events are not confirmed to be solar flares. This is optional.
[41]:
grouped_by_energy = temporal.group('min_energy')['min_energy', 1:]
Plot the result. A 2D plot is not very useful here, so split the data by ‘min_energy’ and either plot each energy in a separate plot or combine them into a dictionary and plot that:
lines = {'<name>': <data_array>}
sc.plot(lines)
[42]:
# YOUR CODE
Solution#
[43]:
lines = {
f"min_energy={energy.value} {energy.unit}": grouped_by_energy[
'min_energy', energy
].bins.sum()
for energy in grouped_by_energy.coords['min_energy']
}
sc.plot(lines)
[43]:
References#
Lin, R., Dennis, B., Hurford, G. et al. The Reuven Ramaty High-Energy Solar Spectroscopic Imager (RHESSI). Sol Phys 210, 3–32 (2002). doi:10.1023/A:1022428818870