Parameter Tables#

Overview#

Sciline supports a mechanism for repeating parts of or all of a computation with different values for one or more parameters. This allows for a variety of use cases, similar to map, reduce, and groupby operations in other systems. We illustrate each of these in the follow three chapters. Sciline’s implementation is based on Cyclebane.

Computing results for series of parameters#

This chapter illustrates how to perform map operations with Sciline.

Starting with the model workflow introduced in Getting Started, we would like to replace the fixed Filename parameter with a series of filenames listed in a “parameter table”. We begin by defining the base pipeline:

[1]:
from typing import NewType
import sciline

_fake_filesytem = {
    'file102.txt': [1, 2, float('nan'), 3],
    'file103.txt': [1, 2, 3, 4],
    'file104.txt': [1, 2, 3, 4, 5],
    'file105.txt': [1, 2, 3],
}

# 1. Define domain types

Filename = NewType('Filename', str)
RawData = NewType('RawData', dict)
CleanedData = NewType('CleanedData', list)
ScaleFactor = NewType('ScaleFactor', float)
Result = NewType('Result', float)


# 2. Define providers


def load(filename: Filename) -> RawData:
    """Load the data from the filename."""

    data = _fake_filesytem[filename]
    return RawData({'data': data, 'meta': {'filename': filename}})


def clean(raw_data: RawData) -> CleanedData:
    """Clean the data, removing NaNs."""
    import math

    return CleanedData([x for x in raw_data['data'] if not math.isnan(x)])


def process(data: CleanedData, param: ScaleFactor) -> Result:
    """Process the data, multiplying the sum by the scale factor."""
    return Result(sum(data) * param)


# 3. Create pipeline

providers = [load, clean, process]
params = {ScaleFactor: 2.0}
base = sciline.Pipeline(providers, params=params)

Aside from not having defined a value for the Filename parameter, this is identical to the example in Getting Started. The task-graph visualization indicates this missing parameter:

[2]:
base.visualize(Result, graph_attr={'rankdir': 'LR'})
[2]:
../_images/user-guide_parameter-tables_4_0.svg

We now define a “parameter table” listing the filenames we would like to process:

[3]:
import pandas as pd

run_ids = [102, 103, 104, 105]
filenames = [f'file{i}.txt' for i in run_ids]
param_table = pd.DataFrame({Filename: filenames}, index=run_ids).rename_axis(
    index='run_id'
)
param_table
[3]:
__main__.Filename
run_id
102 file102.txt
103 file103.txt
104 file104.txt
105 file105.txt

Note how we used a node name of the pipeline as the column name in the parameter table. For convenience we used a pandas.DataFrame to represent the table above, but the use of Pandas is entirely optional. Equivalently the table could be represented as a dict, where each key corresponds to a column header and each value is a list of values for that column, i.e., {Filename: filenames}. Specifying an index is currently not possible in this case, and it will default to a range index.

We can now use Pipeline.map to create a modified pipeline that processes each row in the parameter table:

[4]:
pipeline = base.map(param_table)

We can use the compute_mapped function to compute Result for each index in the parameter table:

[5]:
results = sciline.compute_mapped(pipeline, Result)
pd.DataFrame(results)  # DataFrame for HTML rendering
[5]:
__main__.Result
run_id
102 12.0
103 20.0
104 30.0
105 12.0

Note the use of the run_id index. If the index axis of the DataFrame has no name then a default of dim_0, dim_1, etc. is used.

Note

compute_mapped depends on Pandas, which is not a dependency of Sciline and must be installed separately, e.g., using pip:

pip install pandas

We can also visualize the task graph for computing the series of Result values. For this, we need to get all the node names derived from Result via the map operation. The get_mapped_node_names function can be used to get a pandas.Series of these node names, which we can then visualize:

[6]:
targets = sciline.get_mapped_node_names(pipeline, Result)
pipeline.visualize(targets)
[6]:
../_images/user-guide_parameter-tables_12_0.svg

Nodes that depend on values from a parameter table are drawn with the parameter index name (the row dimension of the parameter table) and index value (defaulting to a range index starting at 0 if no index if given) shown in parenthesis.

Note

With long parameter tables, graphs can get messy and hard to read. Try using visualize(..., compact=True).

The compact=True option to yields a much more compact representation. Instead of drawing every intermediate result and provider for each parameter, we then represent each parameter-dependent result as a single “3D box” node, representing all nodes for different values of the respective parameter.

Combining intermediate results from series of parameters#

This chapter illustrates how to implement reduce operations with Sciline.

Instead of requesting a series of results as above, we use the Pipeline.reduce method and pass a function that combines the results from each parameter into a single result:

[7]:
graph = pipeline.reduce(func=lambda *result: sum(result), name='merged').get('merged')
graph.visualize()
[7]:
../_images/user-guide_parameter-tables_15_0.svg

Note

The func passed to reduce is not making use of Sciline’s mechanism of assembling a graph based on type hints. In particular, the input type may be identical to the output type. The Pipeline.reduce method adds a new node, attached at a unique (but mapped) sink node of the graph. Pipeline.__getitem__ and Pipeline.__setitem__ can be used to compose more complex graphs where the reduction is not the final step.

Note that the graph shown above is identical to the example in the previous section, except for the last two nodes in the graph. The computation now returns a single result:

[8]:
graph.compute()
[8]:
74.0

This is useful if we need to continue computation after gathering results without setting up a second pipeline.

Note

For the reduce operation, all inputs to the reduction function have to be kept in memory simultaneously. This can be very memory intensive. We intend to support, e.g., hierarchical reduction operations in the future, where intermediate results are combined in a tree-like fashion to avoid excessive memory consumption..

Grouping intermediate results based on secondary parameters#

Cyclebane and Sciline do not support ``groupby`` yet, this is work in progress so this example is not functional yet.

This chapter illustrates how to implement groupby operations with Sciline.

Continuing from the examples for map and reduce, we can introduce a secondary parameter in the table, such as the material of the sample:

[9]:
Material = NewType('Material', str)

run_ids = [102, 103, 104, 105]
sample = ['diamond', 'graphite', 'graphite', 'graphite']
filenames = [f'file{i}.txt' for i in run_ids]
param_table = pd.DataFrame(
    {Filename: filenames, Material: sample}, index=run_ids
).rename_axis(index='run_id')
param_table
[9]:
__main__.Filename __main__.Material
run_id
102 file102.txt diamond
103 file103.txt graphite
104 file104.txt graphite
105 file105.txt graphite

Future releases of Sciline will support a groupby operation, roughly as follows:

pipeline = base.map(param_table).groupby(Material).reduce(func=merge)

We can then compute the merged result, grouped by the value of Material. Note how the initial steps of the computation depend on the run_id index name, while later steps depend on Material, a new index name defined by the groupby operation. The files for each run ID have been grouped by their material and then merged.

More examples#

Combining multiple parameters from same table#

[10]:
import sciline as sl

Sum = NewType("Sum", float)
Param1 = NewType("Param1", int)
Param2 = NewType("Param2", int)


def gather(*x: float) -> Sum:
    return Sum(sum(x))


def product(x: Param1, y: Param2) -> float:
    return x / y


params = pd.DataFrame({Param1: [1, 4, 9], Param2: [1, 2, 3]})
pl = sl.Pipeline([product])
pl = pl.map(params).reduce(func=gather, name=Sum)

pl.visualize(Sum)
[10]:
../_images/user-guide_parameter-tables_26_0.svg
[11]:
pl.compute(Sum)
[11]:
np.float64(6.0)

Diamond graphs#

[12]:
Sum = NewType("Sum", float)
Param = NewType("Param", int)
Param1 = NewType("Param1", int)
Param2 = NewType("Param2", int)


def gather(*x: float) -> float:
    return sum(x)


def to_param1(x: Param) -> Param1:
    return Param1(x)


def to_param2(x: Param) -> Param2:
    return Param2(x)


def product(x: Param1, y: Param2) -> float:
    return x * y


pl = sl.Pipeline([product, to_param1, to_param2])
params = pd.DataFrame({Param: [1, 2, 3]})
pl = pl.map(params).reduce(func=gather, name=Sum)
pl.visualize(Sum)
[12]:
../_images/user-guide_parameter-tables_29_0.svg

Combining parameters from different tables#

[13]:
from typing import Any
import sciline as sl

Param1 = NewType("Param1", int)
Param2 = NewType("Param2", int)


def gather(*x: Any) -> list[Any]:
    return list(x)


def product(x: Param1, y: Param2) -> float:
    return x * y


base = sl.Pipeline([product])
pl = (
    base.map({Param1: [1, 4, 9]})
    .map({Param2: [1, 2]})
    .reduce(func=gather, name='reduce_1', index='dim_1')
    .reduce(func=gather, name='reduce_0')
)

pl.visualize('reduce_0')
[13]:
../_images/user-guide_parameter-tables_31_0.svg

Note how intermediates such as float(dim_1, dim_0) depend on two parameters, i.e., we are dealing with a 2-D array of branches in the graph.

[14]:
pl.compute('reduce_0')
[14]:
[[1, 2], [4, 8], [9, 18]]

It is also possible to reduce multiple axes at once. For example, reduce will reduce all axes if no index or axis is specified:

[15]:
pl = (
    base.map({Param1: [1, 4, 9]})
    .map({Param2: [1, 2]})
    .reduce(func=gather, name='reduce_both')
)
pl.visualize('reduce_both')
[15]:
../_images/user-guide_parameter-tables_35_0.svg