Getting Started#

Overview#

This guide motivates and introduces Sciline’s approach to developing workflows: To compute desired results, dependencies are collected based on the type annotations of arguments and return values of callable objects. As a user, you can thus focus on independently developing each step, allowing for more flexibility and testability.

Motivation#

Data analysis workflows are often complex and involve many steps. For example, we may want to:

  1. Define or import functions to use for processing.

  2. Define parameters for processing.

  3. Load the data and apply functions to it.

There are a couple of problems with this:

  • In complex workflows, we are forced to write a lot of boilerplate code to load data, apply functions, and save results. This is tedious and error-prone, e.g., since the order of function calls may be wrong or the wrong data and parameters may be passed to a function. This makes it hard to focus on the actual analysis.

  • In Jupyter notebooks, the order of cell execution is not clear. This frequently leads to errors that are hard to track down in retrospect and analysis results that are hard to reproduce.

In traditional software development some of these problems are addressed by writing unit tests or integration tests. However, in our experience this is challenging to do properly for data analysis workflows:

  • Workflows are often interactive and under active development. Very frequently part of or all of the workflow is written in a Jupyter notebook.

  • It is very time consuming to setup good test data for a workflow with good fidelity, i.e., test data that will actually allow you to catch errors in your workflow.

A very simplified model workflow of such as traditional is shown below:

[1]:
def load(filename: str) -> dict:
    """Load the data from the filename."""
    return {'data': [1, 2, float('nan'), 3], 'meta': {'filename': filename}}


def clean(raw_data: dict) -> list:
    """Clean the data, removing NaNs."""
    import math

    return [x for x in raw_data['data'] if not math.isnan(x)]


def process(data: list, param: float) -> float:
    """Process the data, multiplying the sum by the scale factor."""
    return sum(data) * param


filename = 'data.txt'
scale_factor = 2.0

raw_data = load(filename)
cleaned_data = clean(raw_data)
result = process(cleaned_data, scale_factor)
result
[1]:
12.0

Sciline: Domain types, providers, and pipelines#

Sciline uses a different approach. We can rewrite the model workflow from above as follows:

[2]:
from typing import NewType
import sciline

_fake_filesytem = {'data.txt': [1, 2, float('nan'), 3]}


# 1. Define domain types

Filename = NewType('Filename', str)
RawData = NewType('RawData', dict)
CleanedData = NewType('CleanedData', list)
ScaleFactor = NewType('ScaleFactor', float)
Result = NewType('Result', float)


# 2. Define providers


def load(filename: Filename) -> RawData:
    """Load the data from the filename."""
    data = _fake_filesytem[filename]
    return RawData({'data': data, 'meta': {'filename': filename}})


def clean(raw_data: RawData) -> CleanedData:
    """Clean the data, removing NaNs."""
    import math

    return CleanedData([x for x in raw_data['data'] if not math.isnan(x)])


def process(data: CleanedData, param: ScaleFactor) -> Result:
    """Process the data, multiplying the sum by the scale factor."""
    return Result(sum(data) * param)


# 3. Create pipeline

providers = [load, clean, process]
params = {Filename: 'data.txt', ScaleFactor: 2.0}
pipeline = sciline.Pipeline(providers, params=params)

pipeline.compute(Result)
[2]:
12.0

We can also visualize the task graph for computing Result:

[3]:
pipeline.visualize(Result, graph_attr={'rankdir': 'LR'})
[3]:
../_images/user-guide_getting-started_7_0.svg

Above, we have set up a data pipeline in three steps:

  1. Define the domain types, i.e., unambiguous types specific to the workflow or problem. Sciline uses “domain types” as a way to identify inputs and outputs of processing steps.

  2. Define the providers. Sciline uses “providers” to obtain domain objects, i.e., instances of required domain types. Each provider is either a callable (such as a function) that can compute the required domain object (from other domain objects) or simply a domain object.

  3. Create the pipeline from a list of callables (that require domain objects and compute derived domain objects) and parameters (existing domain objects).

The pipeline sets up a directed acyclic graph (DAG) of processing steps. This is done by inspecting the type-hints of the callables, i.e., the required domain objects.

If you are familiar with the concept of dependency injection you can think of the pipeline as a dependency injection container. Each provider declares its dependencies (required domain objects) via type hints and the pipeline resolves them.

The pipeline can then be used to compute results or to visualize the structure of the workflow.

The advantages of this approach are as follows:

  1. The workflow definition is unambiguous and reader-friendly. It is easy to see what each step in the workflow does and what the dependencies between the processing steps are. For example, a processing step makes sense out of context:

    def clean(raw_data: RawData) -> CleanedData:
       ...
    

    The clearly converts RawData to CleanedData. If we want to understand how our data is cleaned, it is obvious that we have to look at this function. If we want to refine our cleaning procedure, it is obvious that we have to change or replace this provider.

  2. Dependencies are resolved automatically. This means that we do not have to worry about the order of function calls or passing the wrong data to a function. For example, we can be certain that we have cleaned our data, simply by depending on CleanedData in the next step. We do not have to worry about variable or parameter naming, which is a common source of errors. Instead, dependencies are resolved by type.

Defining domain types#

In the above examples, we have used typing.NewType to define domain types. This is a convenient way to define domain types, but it is not required. Any other type can be used as well.

The use of typing.NewType is convenient when data is stored in a common data structure such as pandas.DataFrame or numpy.ndarray. We typically want to avoid subclassing or wrapping these types, since this can be cumbersome for users. Instead, we can use typing.NewType to define domain types that are simply aliases for these types.

Defining providers#

Providers are callables that can compute domain objects. They can be functions, methods, or classes with a __call__ method. The provider must have type hints for all parameters and the return type.