Generic Providers#
Overview#
Sometimes we may want to replicate parts of a workflow and apply it to a different input. In Parameter Tables we saw how we can use a parameter table for this purpose. Another approach is to use generics to define a generic provider.
While parameter tables are well suited for “homogeneous” tasks such as a set of images, generics can be a better choice for “heterogeneous” tasks where the computation steps are (nearly) identical, but the purpose of the computation is different. For example, we may have a “data” image and a “background” image (from a sensor background measurement). We want to perform some initial homogeneous operations on both, but ultimately reach a point where we want to subtract the background from the data.
Generally speaking, using Sciline with generics can have several advantages:
Intuitive syntax for requesting computation of intermediate results.
Specialized providers for intermediate steps can be used instead of the necessarily identical providers when working with parameter tables.
Maintain reusability of providers and avoid synchronization points in workflows by combining parameter tables and generic providers.
Before moving on to the next sections where we will elaborate on these points, consider how generic providers can be used straightforwardly in a pipeline. For example, we can setup a pipeline computing a list of any type (provided that there is a parameter of the type) in a very compact manner:
[1]:
from typing import TypeVar
import sciline
T = TypeVar("T", int, float, str)
def duplicate(x: T) -> list[T]:
"""A generic provider that can make any list."""
return [x, x]
pipeline = sciline.Pipeline([duplicate], params={int: 1, float: 2.0, str: "3"})
print(pipeline.compute(list[int]))
print(pipeline.compute(list[float]))
print(pipeline.compute(list[str]))
[1, 1]
[2.0, 2.0]
['3', '3']
Counter example: Naive approach#
Starting with the model workflow introduced in Getting Started, consider an extension where we also need to subtract a background signal from the data. Naively, we could extend the example as follows, which is very verbose and error prone due to the duplication:
[2]:
from typing import NewType
import sciline
_fake_filesytem = {
'file102.txt': [1, 2, float('nan'), 3],
'file103.txt': [1, 2, 3, 4],
'file104.txt': [1, 2, 3, 4, 5],
'file105.txt': [1, 2, 3],
'background.txt': [0.1, 0.1],
}
# 1. Define domain types
Filename = NewType('Filename', str)
BackgroundFilename = NewType('BackgroundFilename', str)
RawData = NewType('RawData', dict)
RawBackground = NewType('RawBackground', dict)
CleanedData = NewType('CleanedData', list)
CleanedBackground = NewType('CleanedBackground', list)
ScaleFactor = NewType('ScaleFactor', float)
Background = NewType('Background', list)
BackgroundSubtractedData = NewType('BackgroundSubtractedData', list)
Result = NewType('Result', float)
# 2. Define providers
def _load(filename: str) -> dict:
data = _fake_filesytem[filename]
return {'data': data, 'meta': {'filename': filename}}
def load(filename: Filename) -> RawData:
"""Load the data from the filename."""
return RawData(_load(filename))
def load_background(filename: BackgroundFilename) -> RawBackground:
"""Load the background from the filename."""
return RawBackground(_load(filename))
def _clean(raw_data: dict) -> list:
import math
return [x for x in raw_data['data'] if not math.isnan(x)]
def clean(raw_data: RawData) -> CleanedData:
"""Clean the data, removing NaNs."""
return CleanedData(_clean(raw_data))
def clean_background(raw_data: RawBackground) -> CleanedBackground:
"""Clean the background, removing NaNs."""
return CleanedBackground(_clean(raw_data))
def subtract_background(
data: CleanedData, background: CleanedBackground
) -> BackgroundSubtractedData:
return BackgroundSubtractedData([x - sum(background) for x in data])
def process(data: BackgroundSubtractedData, param: ScaleFactor) -> Result:
"""Process the data, multiplying the sum by the scale factor."""
return Result(sum(data) * param)
# 3. Create pipeline
providers = [
load,
load_background,
clean,
clean_background,
process,
subtract_background,
]
params = {
ScaleFactor: 2.0,
Filename: 'file102.txt',
BackgroundFilename: 'background.txt',
}
pipeline = sciline.Pipeline(providers, params=params)
print(f'Result={pipeline.compute(Result)}')
pipeline.visualize(Result)
Result=10.8
[2]:
We would like to reuse the load
and clean
functions for the background, without having to add wrappers and without duplicating all the involved domain types (BackgroundFilename
, RawBackgroundRaw
, and CleanedBackground
). However, we cannot do this directly, since Filename
, RawData
, and CleanedData
are unique identifiers specific to the non-background files and the uniqueness forms the foundation of how Sciline works.
Sciline seeks to address this conundrum by providing a mechanism for using generic providers and for defining generic type aliases, introduced in the next section.
Generic domain types and providers#
To avoid duplicates of domain types and providers, we may instead define generic domain types and generic providers. The example is then written as:
[3]:
from typing import NewType, TypeVar
import sciline
_fake_filesytem = {
'file102.txt': [1, 2, float('nan'), 3],
'file103.txt': [1, 2, 3, 4],
'file104.txt': [1, 2, 3, 4, 5],
'file105.txt': [1, 2, 3],
'background.txt': [0.1, 0.1],
}
# 1. Define domain types
# 1.a Define concrete RunType values we will use.
Sample = NewType('Sample', int)
Background = NewType('Background', int)
# 1.b Define generic domain types
RunType = TypeVar('RunType', Sample, Background)
class Filename(sciline.Scope[RunType, str], str): ...
class RawData(sciline.Scope[RunType, dict], dict): ...
class CleanedData(sciline.Scope[RunType, list], list): ...
# 1.c Define normal domain types
ScaleFactor = NewType('ScaleFactor', float)
BackgroundSubtractedData = NewType('BackgroundSubtractedData', list)
Result = NewType('Result', float)
# 2. Define providers
# 2.a Define generic providers
def load(filename: Filename[RunType]) -> RawData[RunType]:
"""Load the data from the filename."""
data = _fake_filesytem[filename]
return RawData[RunType]({'data': data, 'meta': {'filename': filename}})
def clean(raw_data: RawData[RunType]) -> CleanedData[RunType]:
"""Clean the data, removing NaNs."""
import math
return CleanedData[RunType]([x for x in raw_data['data'] if not math.isnan(x)])
# 2.b Define normal providers
def subtract_background(
data: CleanedData[Sample], background: CleanedData[Background]
) -> BackgroundSubtractedData:
return BackgroundSubtractedData([x - sum(background) for x in data])
def process(data: BackgroundSubtractedData, param: ScaleFactor) -> Result:
"""Process the data, multiplying the sum by the scale factor."""
return Result(sum(data) * param)
# 3. Create pipeline
providers = [load, clean, process, subtract_background]
params = {
ScaleFactor: 2.0,
Filename[Sample]: 'file102.txt',
Filename[Background]: 'background.txt',
}
pipeline = sciline.Pipeline(providers, params=params)
Apart from updated type annotations for load
and clean
, the code is nearly identical to the original example without background subtraction.
Note
Sciline requires type variables that are used as part of keys to have constraints. In the above example, we define a type variable RunType
that is constrained to be either a Sample
or a Background
:
RunType = TypeVar('RunType', Sample, Background)
Note
We use a peculiar-looking syntax for defining “generic type aliases”. We would love to use typing.NewType for this, but it does not allow for definition of generic aliases. The syntax we use (subclassing sciline.Scope) is a workaround for defining generic aliases that work both at runtime and with mypy:
class Filename(sciline.Scope[RunType, str], str):
...
We can get or compute the result as usual:
[4]:
graph = pipeline.get(Result)
print(f'Result={graph.compute()}')
graph.visualize()
Result=10.8
[4]:
In this case, we could have achieved something similar to the above computation graph using the Parameter Tables feature. In the next section we will go through the advantages of using generic providers.
Advantages of using generic providers#
Computing intermediate results#
Generic domain types with named scopes make it simple to request computation of intermediate results with a clear notation:
[5]:
pipeline.compute(CleanedData[Sample])
[5]:
[1, 2, 3]
Specialized providers#
We may wish to specialize a provider for specific values of a generic’s type parameters. For example, we may need to use distinct cleaning functions for Sample
and Background
. We can do so simply by defining a specialized provider for each type:
[6]:
def clean_data(raw_data: RawData[Sample]) -> CleanedData[Sample]:
import math
return CleanedData[Sample]([x for x in raw_data['data'] if not math.isnan(x)])
def clean_background(raw_data: RawData[Background]) -> CleanedData[Background]:
return CleanedData[Background]([x for x in raw_data['data'] if not x < 0])
providers = [load, clean_data, clean_background, process, subtract_background]
pipeline = sciline.Pipeline(providers, params=params)
pipeline.visualize(Result)
[6]:
Generic providers and map operations#
As a more complex example of where generic providers are useful, we may add map
operation, so we can process multiple samples:
[7]:
run_ids = [102, 103, 104, 105]
filenames = [f'file{i}.txt' for i in run_ids]
params = {
ScaleFactor: 2.0,
Filename[Background]: 'background.txt',
}
pipeline = sciline.Pipeline(providers, params=params)
pipeline = pipeline.map({Filename[Sample]: filenames})
# We can collect the results into a list for simplicity in this example
graph = pipeline.reduce(func=lambda *x: list[x], name='collected').get('collected')
graph.visualize()
[7]:
[8]:
graph.compute()
[8]:
list[10.8, 18.4, 28.0, 10.8]