Timothy Wolodzko - Pipelines: The #1 data processing design pattern

In mathematics, two functions \(f\) and \(g\) can be composed \(f \circ g\), what is defined as

\[ (f \circ g)(x) = f(g(x)) \]

In the same way, functions in programming can be composed into pipelines.

Unix pipes

As a design pattern in programming, they were popularized by unix pipes, where a series of commands can be composed using the pipe | operator. For example, the command below would count the unique cells from the second column of a CSV file by combining the cut, sort, and uniq commands.

cut -d, -f2 data.csv | sort | uniq -c

The pattern was a consequence of unix philosophy, which assumed the workflow composed of chained programs

Make each program do one thing well. […]

Expect the output of every program to become the input to another, as yet unknown, program. […]

Pipelines in functional programming

Pipelines are also popular in functional programming languages. For example, Haskell uses syntax inspired by mathematical notation (f . g). OCaml has the |> pipe operator defined as an inflix operator

let (|>) v f = f v

When using it, v |> f gets translated to the f v function call, so 2 |> (+) 2 |> (/) 8 becomes (/) 8 ((+) 2 2). Clojure uses the threading macros -> and ->> that pass the input as the first or second argument subsequently. In Clojure, the example that I just used would take the following form

(->>
    2
    (+ 2)
    (/ 8))

Data processing pipelines in R

The pipes were also a very popular pattern in statistical programming language R, where it was first available through an external library that exposed the %>% operator, but due to its heavy usage in the R community, in R 4.0.0 it was included in the core language as |>. For example, to calculate per-group averages an R user could use the following code

library(dplyr)

mtcars |>
    group_by(cyl) |>
    summarise(mpg = mean(mpg))

## # A tibble: 3 x 2
##     cyl   mpg
##   <dbl> <dbl>
## 1     4  26.7
## 2     6  19.7
## 3     8  15.1

The pipelines like above, consisting of pure functions, fulfill all the mathematical properties of function composition. We can define a new function \[p(x) = f(g(x))\] and use it in a composition \[h \circ p = h \circ f \circ g\]. For the same reason, pipelines can use other pipelines as steps. This is how a program can be decomposed into a series of smaller steps in a functional architecture.

Mutable pipelines

But there is another kind of a pipeline, the mutable (or trainable) one. They are commonly used in Python’s scikit-learn and take the form below

complete_pipeline = Pipeline([
    ("preprocessor", preprocessing_pipeline),
    ("estimator", LinearRegression())
])

This pipeline is an object with the same interface as its steps (exposing the fit, transform, or predict methods). When running complete_pipeline.fit(X, y), the pipeline would call fit in preprocessor and pass the result as an input to the fit method of the estimator. Notice that the fit method mutates the pipeline object. If during preprocessing we used a scaling transformer, it would learn how to scale the data given the training set and be able to apply the transformation to new data. Calling fit on the machine learning model would lead to training it, so the model can be used for making predictions.

Non-mutable, trainable pipelines

We need a fit method that sets up the pipeline and a transform or predict method to apply it. In scikit-learn the pipeline and the objects it consists of are mutable, however, it would also be possible to create a pipeline in a functional programming paradigm. The only thing we need is the support for first-class functions. In such a case, the fit function would return the predicted pipeline build from individual step functions. Such a purely functional pipeline could look like in the example below (or the same example in Scheme).

def fit(steps, input):
    new_steps = []
    for step in steps:
        fitted = step(input)
        input = fitted(input)
        new_steps.append(fitted)
    return new_steps

def transform(steps, input):
    output = input
    for step in steps:
        output = step(output)
    return output

transform(fit([
    lambda x: lambda y: y + x,  # => y + 2
    lambda x: lambda y: y / x,  # => y / 4
], 2), 7)                       # => 9 / 4 = 2.25

As you can see, fit and transform serve completely different purposes. fit is used as a pipeline factory, while transform runs a regular, non-mutable pipeline.

OK, but what’s the fuss?

The main reason for using pipelines is that they lead to more concise and readable code. An additional benefit is that the steps can be easily changed, replaced, or removed, which makes iterating over the code easier. Individual steps can be implemented and tested separately. The steps, like LEGO blocks, can be used to compose many different pipelines. Pipelines also ensure consistency, because they guarantee that the steps would be always invoked in the same order. It is a simple, yet powerful design pattern.