Parameter Tables#
Overview#
Sciline supports a mechanism for repeating parts of or all of a computation with different values for one or more parameters. This allows for a variety of use cases, similar to map, reduce, and groupby operations in other systems. We illustrate each of these in the follow three chapters. Sciline’s implementation is based on Cyclebane.
Computing results for series of parameters#
This chapter illustrates how to perform map operations with Sciline.
Starting with the model workflow introduced in Getting Started, we would like to replace the fixed Filename
parameter with a series of filenames listed in a “parameter table”. We begin by defining the base pipeline:
[1]:
from typing import NewType
import sciline
_fake_filesytem = {
'file102.txt': [1, 2, float('nan'), 3],
'file103.txt': [1, 2, 3, 4],
'file104.txt': [1, 2, 3, 4, 5],
'file105.txt': [1, 2, 3],
}
# 1. Define domain types
Filename = NewType('Filename', str)
RawData = NewType('RawData', dict)
CleanedData = NewType('CleanedData', list)
ScaleFactor = NewType('ScaleFactor', float)
Result = NewType('Result', float)
# 2. Define providers
def load(filename: Filename) -> RawData:
"""Load the data from the filename."""
data = _fake_filesytem[filename]
return RawData({'data': data, 'meta': {'filename': filename}})
def clean(raw_data: RawData) -> CleanedData:
"""Clean the data, removing NaNs."""
import math
return CleanedData([x for x in raw_data['data'] if not math.isnan(x)])
def process(data: CleanedData, param: ScaleFactor) -> Result:
"""Process the data, multiplying the sum by the scale factor."""
return Result(sum(data) * param)
# 3. Create pipeline
providers = [load, clean, process]
params = {ScaleFactor: 2.0}
base = sciline.Pipeline(providers, params=params)
Aside from not having defined a value for the Filename
parameter, this is identical to the example in Getting Started. The task-graph visualization indicates this missing parameter:
[2]:
base.visualize(Result, graph_attr={'rankdir': 'LR'})
[2]:
We now define a “parameter table” listing the filenames we would like to process:
[3]:
import pandas as pd
run_ids = [102, 103, 104, 105]
filenames = [f'file{i}.txt' for i in run_ids]
param_table = pd.DataFrame({Filename: filenames}, index=run_ids).rename_axis(
index='run_id'
)
param_table
[3]:
__main__.Filename | |
---|---|
run_id | |
102 | file102.txt |
103 | file103.txt |
104 | file104.txt |
105 | file105.txt |
Note how we used a node name of the pipeline as the column name in the parameter table. For convenience we used a pandas.DataFrame
to represent the table above, but the use of Pandas is entirely optional. Equivalently the table could be represented as a dict
, where each key corresponds to a column header and each value is a list of values for that column, i.e., {Filename: filenames}
. Specifying an index is currently not possible in this case, and it will default to a range index.
We can now use Pipeline.map to create a modified pipeline that processes each row in the parameter table:
[4]:
pipeline = base.map(param_table)
We can use the compute_mapped function to compute Result
for each index in the parameter table:
[5]:
results = sciline.compute_mapped(pipeline, Result)
pd.DataFrame(results) # DataFrame for HTML rendering
[5]:
__main__.Result | |
---|---|
run_id | |
102 | 12.0 |
103 | 20.0 |
104 | 30.0 |
105 | 12.0 |
Note the use of the run_id
index. If the index axis of the DataFrame has no name then a default of dim_0
, dim_1
, etc. is used.
Note
compute_mapped depends on Pandas, which is not a dependency of Sciline and must be installed separately, e.g., using pip:
pip install pandas
We can also visualize the task graph for computing the series of Result
values. For this, we need to get all the node names derived from Result
via the map
operation. The get_mapped_node_names function can be used to get a pandas.Series
of these node names, which we can then visualize:
[6]:
targets = sciline.get_mapped_node_names(pipeline, Result)
pipeline.visualize(targets)
[6]:
Nodes that depend on values from a parameter table are drawn with the parameter index name (the row dimension of the parameter table) and index value (defaulting to a range index starting at 0 if no index if given) shown in parenthesis.
Note
With long parameter tables, graphs can get messy and hard to read. Try using visualize(..., compact=True)
.
The compact=True
option to yields a much more compact representation. Instead of drawing every intermediate result and provider for each parameter, we then represent each parameter-dependent result as a single “3D box” node, representing all nodes for different values of the respective parameter.
Combining intermediate results from series of parameters#
This chapter illustrates how to implement reduce operations with Sciline.
Instead of requesting a series of results as above, we use the Pipeline.reduce method and pass a function that combines the results from each parameter into a single result:
[7]:
graph = pipeline.reduce(func=lambda *result: sum(result), name='merged').get('merged')
graph.visualize()
[7]:
Note
The func
passed to reduce
is not making use of Sciline’s mechanism of assembling a graph based on type hints. In particular, the input type may be identical to the output type. The Pipeline.reduce method adds a new node, attached at a unique (but mapped) sink node of the graph.
Pipeline.__getitem__ and Pipeline.__setitem__ can be used to compose more complex graphs where the reduction is not the final step.
Note that the graph shown above is identical to the example in the previous section, except for the last two nodes in the graph. The computation now returns a single result:
[8]:
graph.compute()
[8]:
74.0
This is useful if we need to continue computation after gathering results without setting up a second pipeline.
Note
For the reduce
operation, all inputs to the reduction function have to be kept in memory simultaneously. This can be very memory intensive. We intend to support, e.g., hierarchical reduction operations in the future, where intermediate results are combined in a tree-like fashion to avoid excessive memory consumption..
Grouping intermediate results based on secondary parameters#
Cyclebane and Sciline do not support ``groupby`` yet, this is work in progress so this example is not functional yet.
This chapter illustrates how to implement groupby operations with Sciline.
Continuing from the examples for map and reduce, we can introduce a secondary parameter in the table, such as the material of the sample:
[9]:
Material = NewType('Material', str)
run_ids = [102, 103, 104, 105]
sample = ['diamond', 'graphite', 'graphite', 'graphite']
filenames = [f'file{i}.txt' for i in run_ids]
param_table = pd.DataFrame(
{Filename: filenames, Material: sample}, index=run_ids
).rename_axis(index='run_id')
param_table
[9]:
__main__.Filename | __main__.Material | |
---|---|---|
run_id | ||
102 | file102.txt | diamond |
103 | file103.txt | graphite |
104 | file104.txt | graphite |
105 | file105.txt | graphite |
Future releases of Sciline will support a groupby
operation, roughly as follows:
pipeline = base.map(param_table).groupby(Material).reduce(func=merge)
We can then compute the merged result, grouped by the value of Material
. Note how the initial steps of the computation depend on the run_id
index name, while later steps depend on Material
, a new index name defined by the groupby
operation. The files for each run ID have been grouped by their material and then merged.
More examples#
Combining multiple parameters from same table#
[10]:
import sciline as sl
Sum = NewType("Sum", float)
Param1 = NewType("Param1", int)
Param2 = NewType("Param2", int)
def gather(*x: float) -> Sum:
return Sum(sum(x))
def product(x: Param1, y: Param2) -> float:
return x / y
params = pd.DataFrame({Param1: [1, 4, 9], Param2: [1, 2, 3]})
pl = sl.Pipeline([product])
pl = pl.map(params).reduce(func=gather, name=Sum)
pl.visualize(Sum)
[10]:
[11]:
pl.compute(Sum)
[11]:
np.float64(6.0)
Diamond graphs#
[12]:
Sum = NewType("Sum", float)
Param = NewType("Param", int)
Param1 = NewType("Param1", int)
Param2 = NewType("Param2", int)
def gather(*x: float) -> float:
return sum(x)
def to_param1(x: Param) -> Param1:
return Param1(x)
def to_param2(x: Param) -> Param2:
return Param2(x)
def product(x: Param1, y: Param2) -> float:
return x * y
pl = sl.Pipeline([product, to_param1, to_param2])
params = pd.DataFrame({Param: [1, 2, 3]})
pl = pl.map(params).reduce(func=gather, name=Sum)
pl.visualize(Sum)
[12]:
Combining parameters from different tables#
[13]:
from typing import Any
import sciline as sl
Param1 = NewType("Param1", int)
Param2 = NewType("Param2", int)
def gather(*x: Any) -> list[Any]:
return list(x)
def product(x: Param1, y: Param2) -> float:
return x * y
base = sl.Pipeline([product])
pl = (
base.map({Param1: [1, 4, 9]})
.map({Param2: [1, 2]})
.reduce(func=gather, name='reduce_1', index='dim_1')
.reduce(func=gather, name='reduce_0')
)
pl.visualize('reduce_0')
[13]:
Note how intermediates such as float(dim_1, dim_0)
depend on two parameters, i.e., we are dealing with a 2-D array of branches in the graph.
[14]:
pl.compute('reduce_0')
[14]:
[[1, 2], [4, 8], [9, 18]]
It is also possible to reduce multiple axes at once. For example, reduce
will reduce all axes if no index
or axis
is specified:
[15]:
pl = (
base.map({Param1: [1, 4, 9]})
.map({Param2: [1, 2]})
.reduce(func=gather, name='reduce_both')
)
pl.visualize('reduce_both')
[15]: