Generic Providers#

Overview#

Sometimes we may want to replicate parts of a workflow and apply it to a different input. In Parameter Tables we saw how we can use a parameter table for this purpose. Another approach is to use generics to define a generic provider.

While parameter tables are well suited for “homogeneous” tasks such as a set of images, generics can be a better choice for “heterogeneous” tasks where the computation steps are (nearly) identical, but the purpose of the computation is different. For example, we may have a “data” image and a “background” image (from a sensor background measurement). We want to perform some initial homogeneous operations on both, but ultimately reach a point where we want to subtract the background from the data.

Generally speaking, using Sciline with generics can have several advantages:

Intuitive syntax for requesting computation of intermediate results.
Specialized providers for intermediate steps can be used instead of the necessarily identical providers when working with parameter tables.
Maintain reusability of providers and avoid synchronization points in workflows by combining parameter tables and generic providers.

Before moving on to the next sections where we will elaborate on these points, consider how generic providers can be used straightforwardly in a pipeline. For example, we can setup a pipeline computing a list of any type (provided that there is a parameter of the type) in a very compact manner:

[1]:

from typing import TypeVar
import sciline

T = TypeVar("T", int, float, str)


def duplicate(x: T) -> list[T]:
    """A generic provider that can make any list."""
    return [x, x]


pipeline = sciline.Pipeline([duplicate], params={int: 1, float: 2.0, str: "3"})

print(pipeline.compute(list[int]))
print(pipeline.compute(list[float]))
print(pipeline.compute(list[str]))

[1, 1]
[2.0, 2.0]
['3', '3']

Counter example: Naive approach#

Starting with the model workflow introduced in Getting Started, consider an extension where we also need to subtract a background signal from the data. Naively, we could extend the example as follows, which is very verbose and error prone due to the duplication:

[2]:

from typing import NewType
import sciline

_fake_filesytem = {
    'file102.txt': [1, 2, float('nan'), 3],
    'file103.txt': [1, 2, 3, 4],
    'file104.txt': [1, 2, 3, 4, 5],
    'file105.txt': [1, 2, 3],
    'background.txt': [0.1, 0.1],
}

# 1. Define domain types

Filename = NewType('Filename', str)
BackgroundFilename = NewType('BackgroundFilename', str)
RawData = NewType('RawData', dict)
RawBackground = NewType('RawBackground', dict)
CleanedData = NewType('CleanedData', list)
CleanedBackground = NewType('CleanedBackground', list)
ScaleFactor = NewType('ScaleFactor', float)
Background = NewType('Background', list)
BackgroundSubtractedData = NewType('BackgroundSubtractedData', list)
Result = NewType('Result', float)


# 2. Define providers


def _load(filename: str) -> dict:
    data = _fake_filesytem[filename]
    return {'data': data, 'meta': {'filename': filename}}


def load(filename: Filename) -> RawData:
    """Load the data from the filename."""

    return RawData(_load(filename))


def load_background(filename: BackgroundFilename) -> RawBackground:
    """Load the background from the filename."""
    return RawBackground(_load(filename))


def _clean(raw_data: dict) -> list:
    import math

    return [x for x in raw_data['data'] if not math.isnan(x)]


def clean(raw_data: RawData) -> CleanedData:
    """Clean the data, removing NaNs."""
    return CleanedData(_clean(raw_data))


def clean_background(raw_data: RawBackground) -> CleanedBackground:
    """Clean the background, removing NaNs."""
    return CleanedBackground(_clean(raw_data))


def subtract_background(
    data: CleanedData, background: CleanedBackground
) -> BackgroundSubtractedData:
    return BackgroundSubtractedData([x - sum(background) for x in data])


def process(data: BackgroundSubtractedData, param: ScaleFactor) -> Result:
    """Process the data, multiplying the sum by the scale factor."""
    return Result(sum(data) * param)


# 3. Create pipeline

providers = [
    load,
    load_background,
    clean,
    clean_background,
    process,
    subtract_background,
]
params = {
    ScaleFactor: 2.0,
    Filename: 'file102.txt',
    BackgroundFilename: 'background.txt',
}
pipeline = sciline.Pipeline(providers, params=params)

print(f'Result={pipeline.compute(Result)}')
pipeline.visualize(Result)

Result=10.8

[2]:

../_images/user-guide_generic-providers_3_1.svg

We would like to reuse the load and clean functions for the background, without having to add wrappers and without duplicating all the involved domain types (BackgroundFilename, RawBackgroundRaw, and CleanedBackground). However, we cannot do this directly, since Filename, RawData, and CleanedData are unique identifiers specific to the non-background files and the uniqueness forms the foundation of how Sciline works.

Sciline seeks to address this conundrum by providing a mechanism for using generic providers and for defining generic type aliases, introduced in the next section.

Generic domain types and providers#

To avoid duplicates of domain types and providers, we may instead define generic domain types and generic providers. The example is then written as:

[3]:

from typing import NewType, TypeVar
import sciline

_fake_filesytem = {
    'file102.txt': [1, 2, float('nan'), 3],
    'file103.txt': [1, 2, 3, 4],
    'file104.txt': [1, 2, 3, 4, 5],
    'file105.txt': [1, 2, 3],
    'background.txt': [0.1, 0.1],
}

# 1. Define domain types

# 1.a Define concrete RunType values we will use.
Sample = NewType('Sample', int)
Background = NewType('Background', int)

# 1.b Define generic domain types
RunType = TypeVar('RunType', Sample, Background)


class Filename(sciline.Scope[RunType, str], str): ...


class RawData(sciline.Scope[RunType, dict], dict): ...


class CleanedData(sciline.Scope[RunType, list], list): ...


# 1.c Define normal domain types
ScaleFactor = NewType('ScaleFactor', float)
BackgroundSubtractedData = NewType('BackgroundSubtractedData', list)
Result = NewType('Result', float)


# 2. Define providers

# 2.a Define generic providers


def load(filename: Filename[RunType]) -> RawData[RunType]:
    """Load the data from the filename."""

    data = _fake_filesytem[filename]
    return RawData[RunType]({'data': data, 'meta': {'filename': filename}})


def clean(raw_data: RawData[RunType]) -> CleanedData[RunType]:
    """Clean the data, removing NaNs."""
    import math

    return CleanedData[RunType]([x for x in raw_data['data'] if not math.isnan(x)])


# 2.b Define normal providers
def subtract_background(
    data: CleanedData[Sample], background: CleanedData[Background]
) -> BackgroundSubtractedData:
    return BackgroundSubtractedData([x - sum(background) for x in data])


def process(data: BackgroundSubtractedData, param: ScaleFactor) -> Result:
    """Process the data, multiplying the sum by the scale factor."""
    return Result(sum(data) * param)


# 3. Create pipeline

providers = [load, clean, process, subtract_background]
params = {
    ScaleFactor: 2.0,
    Filename[Sample]: 'file102.txt',
    Filename[Background]: 'background.txt',
}
pipeline = sciline.Pipeline(providers, params=params)

Apart from updated type annotations for load and clean, the code is nearly identical to the original example without background subtraction.

Note

Sciline requires type variables that are used as part of keys to be constrained to a fixed set of types. There are two options for this:

Type variable constraints.
Pipeline constraints.

Here, we use the first option. See the section on Constraining type vars in a pipeline for details and the second option.

In the above example, we define a type variable RunType that is constrained to be either a Sample or a Background:

RunType = TypeVar('RunType', Sample, Background)

Note

We use a peculiar-looking syntax for defining “generic type aliases”. We would love to use typing.NewType for this, but it does not allow for definition of generic aliases. The syntax we use (subclassing sciline.Scope) is a workaround for defining generic aliases that work both at runtime and with mypy:

class Filename(sciline.Scope[RunType, str], str):
    ...

We can get or compute the result as usual:

[4]:

graph = pipeline.get(Result)
print(f'Result={graph.compute()}')
graph.visualize()

Result=10.8

[4]:

../_images/user-guide_generic-providers_8_1.svg

In this case, we could have achieved something similar to the above computation graph using the Parameter Tables feature. In the next section we will go through the advantages of using generic providers.

Advantages of using generic providers#

Computing intermediate results#

Generic domain types with named scopes make it simple to request computation of intermediate results with a clear notation:

[5]:

pipeline.compute(CleanedData[Sample])

[5]:

[1, 2, 3]

Specialized providers#

We may wish to specialize a provider for specific values of a generic’s type parameters. For example, we may need to use distinct cleaning functions for Sample and Background. We can do so simply by defining a specialized provider for each type:

[6]:

def clean_data(raw_data: RawData[Sample]) -> CleanedData[Sample]:
    import math

    return CleanedData[Sample]([x for x in raw_data['data'] if not math.isnan(x)])


def clean_background(raw_data: RawData[Background]) -> CleanedData[Background]:
    return CleanedData[Background]([x for x in raw_data['data'] if not x < 0])


providers = [load, clean_data, clean_background, process, subtract_background]
pipeline = sciline.Pipeline(providers, params=params)
pipeline.visualize(Result)

[6]:

../_images/user-guide_generic-providers_12_0.svg

Generic providers and map operations#

As a more complex example of where generic providers are useful, we may add map operation, so we can process multiple samples:

[7]:

run_ids = [102, 103, 104, 105]
filenames = [f'file{i}.txt' for i in run_ids]
params = {
    ScaleFactor: 2.0,
    Filename[Background]: 'background.txt',
}
pipeline = sciline.Pipeline(providers, params=params)
pipeline = pipeline.map({Filename[Sample]: filenames})

# We can collect the results into a list for simplicity in this example
graph = pipeline.reduce(func=lambda *x: list[x], name='collected').get('collected')
graph.visualize()

[7]:

../_images/user-guide_generic-providers_14_0.svg

[8]:

graph.compute()

[8]:

list[10.8, 18.4, 28.0, 10.8]

Constraining type vars in a pipeline#

As mentioned above, type variables must be constrained in Sciline. So far, we did this by listing constraints when defining type variables. Sometimes, however, this is too restrictive because we need to know all possible types that the type variable can take. Pipeline constraints offer an alternative. Instead of constraining the definition of a type variable, we can constrain its usages in a pipeline.

In this example, T itself is unconstrained. But all occurrences of T in pipeline will be constrained to int and float.

[9]:

from typing import TypeVar
import sciline

T = TypeVar("T")


def duplicate(x: T) -> list[T]:
    """A generic provider that can make any list."""
    return [x, x]


pipeline = sciline.Pipeline(
    [duplicate],
    params={int: 1, float: 2.0},
    constraints={T: [int, float]},
)

print(pipeline.compute(list[int]))
print(pipeline.compute(list[float]))

[1, 1]
[2.0, 2.0]

This allows us to use different constraints in different pipelines while using the same T and duplicate in those pipelines.

It is also possible to combine type variable constraints and pipeline constraints. The latter must then be stricter than the former and further limit the allowed types.

This Page