Workflow Design#

Context#

This document discusses problems and requirements for data-reduction of neutron-scattering experiments which have led to the design of Sciline. The terminology used in the examples is specific to neutron-scattering experiments, but the concepts are general and can be applied to other domains.

Introduction#

Traditional data-reduction workflows#

Traditionally, we have supplied users with a toolbox of algorithms and optionally a reduction script or a notebook that uses those algorithms. Conceptually this looks similar to the following:

# Define parameters
sample_run_id = 12345
background_run_id = 12300
direct_beam_filename = 'direct_beam.h5'
wav_bins = linspace(...)
Q_bins = linspace(...)

# Load data
sample = load_sample(run=sample_run_id)
background = load_background(run=background_run_id)
direct_beam = load_direct_beam(run=direct_beam_filename)

# Process
mask_detectors(sample)
mask_detectors(background)
sample_monitors = preprocess_monitors(sample, wav_bins)
background_monitors = preprocess_monitors(background, wav_bins)

transmission_fraction = transmission_fraction(**sample_monitors)
sample_iofq = compute_i_of_q(
    sample,
    direct_beam,
    transmission_fraction,
    Q_bins)

transmission_fraction = transmission_fraction(**background_monitors)
background_iofq = compute_i_of_q(
    background,
    direct_beam,
    transmission_fraction,
    Q_bins)

iofq = subtract_background(sample_iofq, background_iofq)

This is an imperative workflow, where the user specifies the order of operations and the dependencies between them. This is not ideal for a number of reasons:

  • The user has to know the order of operations and the dependencies between them.

  • The user has to know which algorithms to use.

  • The user has to know which parameters to use for each algorithm.

  • The user has to know which data to use for each algorithm.

  • The user can easily introduce mistakes into a workflow, e.g., by using the wrong order of operations, or by overwriting data. This is especially problematic in Jupyter notebooks, where the user can easily run cells out of order.

Our most basic programming models provide little help to the user. For example we typically write components of reduction workflows as functions of scipp.Variable or scipp.DataArray objects:

def transmission_fraction(
    incident_monitor: sc.DataArray,
    transmission_monitor: sc.DataArray,
) -> sc.DataArray:
"""
Compute transmission fraction from incident and transmission monitors.

Parameters
----------
incident_monitor:
    Incident monitor.
transmission_monitor:
    Transmission monitor.
"""
    return transmission_monitor / incident_monitor

Here, we rely on naming of function parameters as well as docstrings to convey the meaning of the parameters, and it is up to the user to pass the correct inputs. While good practices such as keyword-only arguments can help, this is still far from a scalable and maintainable solution.

As an improvement, we could adopt an approach with more specific domain types, e.g.,

def transmission_fraction(
    incident_monitor: IncidentMonitor,
    transmission_monitor: TransmissionMonitor,
) -> TransmissionFraction:
    return TransmissionFraction(transmission_monitor / incident_monitor)

We could now run mypy on reduction scripts to ensure that the correct types are passed to each function. However, this is not practical with dynamic workflows, i.e., when users modifying workflows in a Jupyter notebooks on the fly. Aside from this, such an approach would still not help with several of the other issues listed above.

High-level summary of proposed approach#

We propose an architecture combining domain-driven design with dependency injection. Dependency injection aids in building a declarative workflow. We define domain-specific concepts that are meaningful to the (instrument) scientist. Simple functions provide workflow components that define relations between these domain concepts.

Concretely, we propose to define specific domain types, such as IncidentMonitor, TransmissionMonitor, and TransmissionFraction in the example above. However, instead of the user having to pass these to functions, we use dependency injection to provide them to the functions. In essence this will build a workflow’s task graph.

Domain-driven design#

Domain-Driven Design (DDD) is an approach to software development that aims to make software more closely match the domain it is used in. The obvious benefit of this is that it makes it easier for domain experts to understand and modify the software.

How should we define the domain for the purpose of data reduction? Looking at, e.g., Mantid, we see that the domain is defined as “data reduction for any type of neutron scattering experiment”. This has led to more than 1000 algorithms, making it hard for users to know how to use them. Furthermore, while algorithms provide some sort of domain-specific language, the data types are generic.

What we propose here is to define the domain more narrowly, highly specific to a technique or even specific to an instrument or experiment. This will reduce the scope to cover in the domain-specific language. By making data types specific to the domain, we provide nouns for the domain-specific language.

Dependency injection#

Dependency injection is a common technique for implementing the inversion of control principle. It makes components of a system more loosely coupled, and makes it easier to replace components, including for testing purposes. Dependency injection can be performed manually, but there are also frameworks that can help with this.

From the Guice documentation (Guice is a dependency injection framework for Java):

“This is the essence of dependency injection. If you need something, you don’t go out and get it from somewhere, or even ask a class to return you something. Instead, you simply declare that you can’t do your work without it, and rely on Guice to give you what you need.

This model is backwards from how most people think about code: it’s a more declarative model rather than an imperative one. This is why dependency injection is often described as a kind of inversion of control (IoC).” (emphasis added)

Architecture#

In a nutshell#

  1. The user will define building blocks of a workflow using highly specific domain types for the type-hints, such as IncidentMonitor, TransmissionMonitor, and TransmissionFraction, e.g.,

    def transmission_fraction(
        incident_monitor: IncidentMonitor,
        transmission_monitor: TransmissionMonitor,
    ) -> TransmissionFraction:
        return TransmissionFraction(transmission_monitor / incident_monitor)
    
  2. The user passes a set of building blocks to the system, which assembles a dependency graph based on the type-hints.

  3. The user requests a specific output from the system using one of the domain types. This may be computed directly, or the system may construct a dask graph to compute the output.

Depending on the level of expertise of the user and the level of control they need, step 1.) or step 1.) and 2.) will be omitted, as pre-defined building blocks and sets of building blocks can be provided in domain-specific support-libraries for common use cases.

Parameter handling#

Generally, the user must provide configuration parameters to a workflow. In many cases there are defaults that can be used. In either case, these parameters must be associated with the correct step in the workflow. This gets complicated by the non-linear nature of the workflow. A flat list of parameters has been used traditionally, relying entirely on parameter naming. This is problematic for two reasons: First, certain basic workflow steps may be used in multiple places. Second, workflows frequently contain nested steps, which may have the same parameters (or not). This makes the process of setting parameters somewhat opaque and error-prone. Furthermore, it relies on a hand-written higher-level workflow to set parameters for nested steps, mapping between globally-uniquely-named parameters and the parameters of the nested steps. These, in turn, require complicated testing.

A hierarchical parameter system could provide an alternative, but makes it harder to set “global” parameters. For example, we may want to use the same wavelength-binning for all steps in the workflow.

We propose to handle parameters as dependencies of workflow steps. That is, the dependency-injection system is used to provide parameters to workflow steps. Parameters are identified via their type, i.e., we will require defining a domain-specific type for each parameter, such as WavelengthBinning.

For nested workflows, we can use a child injector, which provides a scope for parameters. Parent-scopes can be searched for parameters that are not found in the child-scope, providing a mechanism for “global” parameters.

Note: The idea of using child injectors was discarded during implementation for a number of reasons. Parameters in a “nested” scope can now be realized using the mechanism of generic providers.

Metadata handling#

There have been a number of discussions around metadata handling. For example, the support (or non-support) of an arbitrary attrs dict as part of scipp.Variable and scipp.DataArray. Furthermore, we may have metadata that is part of the data-catalog, which may partially overlap with the metadata that is part of the data itself. The current conclusion is that any attempt to handle metadata in a generic and automatic way will not work. Therefore, if a user wants to provide metadata for a workflow result, they must do so explicitly by specifying functions that can assemble that metadata. As with regular results, this can be done by injecting the input metadata into the function that computes the result’s metadata.

Domain-specific types#

The system will use domain-specific types to identify workflow steps and parameters. The system does not require subclassing or use of decorators. Domain-specific types can be defined as regular classes or subclasses of existing types. In many cases, a simple alias will be sufficient.

typing.NewType can be used as a simple way of creating a domain-specific type alias for type-checking. For example, we can use it to create a type for a scipp.DataArray that represents a transmission monitor. This avoids a more complex solution such as creating a wrapper class or a subclass:

import typing

TransmissionMonitor = typing.NewType('TransmissionMonitor', scipp.DataArray)

Note that this does not create a new type, but rather a new name for an existing type. That is, isinstance(monitor, TransmissionMonitor) does not work, since TransmissionMonitor is not a type. Furthermore, operations will revert to the underlying type, e.g., monitor * 2 will return a scipp.DataArray. For this application this would actually be desired behavior: Applying an operation to a domain type will generally result in a different type, so falling back to the underlying type is the correct behavior and forces the user to be explicit about the type of the result.

Notes#

  • Earlier versions of this design document included a detailed discussion on how to handle nested workflows using child-injectors. During initial implementation efforts this idea was discarded, as it was found to be too complicated and not necessary.

  • Earlier versions of this design document included considerations on the use of the injector Python library. During early implementation efforts it was found that this library did not provide advantages beyond the very basic features (which can be implemented in a few lines of code). For example, it was attempted to use injector’s multibind and child-injector mechanism to handle nested workflows as well as multiple runs. This turned out to be way too complicated. By using a dedicated home-grown solution, in particular using generic providers and parameter tables, we found a more straightforward solution and furthermore avoided introducing a dependency.