Provenance#
It is generally useful to be able to track the provenance of a dataset, that is, where it came from and how it was processed. Sciline can help because its task graphs encode the ‘how’. To this end, a graph needs to be stored together with the processed data.
Considering that such graphs might be stored for a long time, they need to be serialized to a format that
represents the full structure of the graph,
is readable by software that does not depend on Sciline or Python,
is human readable (with some effort, the priority is machine readability).
Points 2 and 3 exclude serializing the full Python objects, e.g., with pickle
. But this means that any solution will be partial as it cannot capture the full environment that the pipeline is defined in. In particular, it cannot track functions called by providers that are external to the pipeline. See the section on Reproducibility.
Note that the Graphviz objects produced by Pipeline.visualize are not sufficient because they do not encode the full graph structure but are instead optimized to give an overview of a task graph.
Attention:
Sciline does not currently support serializing the values of parameters. This is the responsibility of the user, at least for now.
Serialization of task graphs to JSON#
Task graphs can be serialized to a simple JSON object that contains a node list and an edge list. This format is similar to other JSON graph formats used by, e.g., Networkx and JSON Graph Format.
First, define a helper to display JSON:
[1]:
from IPython import display
import json
def show_json(j: dict):
return display.Markdown(f"""```json
{json.dumps(j, indent=2)}
```""")
Simple example#
First, construct a short pipeline, including some generic types and providers:
[2]:
from typing import NewType, TypeVar
import sciline
A = NewType('A', int)
B = NewType('B', int)
T = TypeVar('T', A, B)
class Int(sciline.Scope[T, int], int): ...
def make_int_b() -> Int[B]:
return Int[B](2)
def to_string(a: Int[A], b: Int[B]) -> str:
return f'a: {a}, b: {b}'
pipeline = sciline.Pipeline([make_int_b, to_string], params={Int[A]: 3})
task_graph = pipeline.get(str)
task_graph.visualize(graph_attr={'rankdir': 'LR'})
[2]:
This graph can be serialized to JSON using its serialize method. We need to use the task graph obtained from pipeline.get
for this purpose, not the pipeline itself:
[3]:
show_json(task_graph.serialize())
[3]:
{
"directed": true,
"multigraph": false,
"nodes": [
{
"id": "1",
"kind": "function",
"label": "make_int_b",
"function": "__main__.make_int_b",
"args": [],
"kwargs": {}
},
{
"id": "2",
"kind": "data",
"label": "Int[B]",
"type": "__main__.Int[__main__.B]"
},
{
"id": "5",
"kind": "function",
"label": "to_string",
"function": "__main__.to_string",
"args": [
"3",
"6"
],
"kwargs": {}
},
{
"id": "8",
"kind": "data",
"label": "str",
"type": "builtins.str"
},
{
"id": "4",
"kind": "data",
"label": "Int[A]",
"type": "__main__.Int[__main__.A]"
}
],
"edges": [
{
"id": "0",
"source": "1",
"target": "2"
},
{
"id": "3",
"source": "4",
"target": "5"
},
{
"id": "6",
"source": "2",
"target": "5"
},
{
"id": "7",
"source": "5",
"target": "8"
}
]
}
Let’s disect the format.
The directed
and multigraph
properties always have the same values. They are included for compatibility with Networkx and JSON Graph Format.
Note the use of qualified names. Those make it easier to identify exactly what types and functions have been used while the label
is a shortened representation.
All ids are unique across nodes and edges.
nodes
#
An array of node objects. The nodes always have an id
, label
, kind
, and out
property.
id
is a unique identifier of the node. (Do not rely on it having a specific format, this may change at any time!)label
is a human-readable name for the node.kind
indicates what the node represents, there are parameter nodes and function nodes which correspond to parameters and providers in the pipeline, respectively.out
holds the fully qualified name of the type of object that the node produces. That is, the type of the parameter or the return type of the function.
Depending on their kind
, nodes have additional properties. For function nodes, there is a function
property which stores the fully qualified name of the function that the provider uses. In addition, there are args
and kwargs
properties which list edge ids for all arguments and keyword arguments of the function.
edges
#
An array of directed edges.
id
is a unique identifier of the edge.source
andtarget
refer to theid
field of a node.
Reproducibility#
The JSON format used here was chosen as a simple, future-proof format for task graphs. It can, however, only capture part of the actual pipeline. For example, it only shows the structure of the graph and contains the names of functions and types. But it does not encode the implementation of those functions or types. Thus, the graph can only be correctly reconstructed in an environment that contains the same software that was used to write the graph. This includes all packages that might be used by the providers.
Hint:
Note, for example, that the graphs here refer to functions and types in __main__
, that is, functions and types defined in this Jupyter notebook. These cannot be reliably reconstructed. Thus, it is recommended to define all pipeline components in a Python package with a version number.
Warning:
Python 3.12 type aliases (type MyType = int
) only allow for limited inspection of the alias. In particular, they have no __qualname__
. This means that they can only be fully represented when defined at the top level of a module.
JSON schema#
The schema for the JSON object returned by TaskGraph.serialize
is available as part of the Sciline package:
[4]:
from sciline.serialize import json_schema # noqa: F401
The schema is too long to show it here. It is available online at scipp/sciline