# Pipelines: The #1 data processing design pattern

In mathematics, two functions \(f\) and \(g\) can be composed \(f \circ g\), what is defined as

\[(f \circ g)(x) = f(g(x))\]In the same way, functions in programming can be composed into pipelines.

## Unix pipes

As a design pattern in programming, they were popularized by unix pipes, where a series of commands can be
composed using the pipe `|`

operator. For example, the command below would count the unique cells from the second
column of a CSV file by combining the `cut`

, `sort`

, and `uniq`

commands.

```
cut -d, -f2 data.csv | sort | uniq -c
```

The pattern was a consequence of unix philosophy, which assumed the workflow composed of chained programs

- Make each program do one thing well. […]
- Expect the output of every program to become the input to another, as yet unknown, program. […]

## Pipelines in functional programming

Pipelines are also popular in functional programming languages. For example, Haskell uses syntax inspired by mathematical
notation `(f . g)`

. OCaml has the `|>`

pipe operator defined as an inflix operator

```
let (|>) v f = f v
```

When using it, `v |> f`

gets translated to the `f v`

function call, so `2 |> (+) 2 |> (/) 8`

becomes
`(/) 8 ((+) 2 2)`

. Clojure uses the threading macros `->`

and `->>`

that pass the input as
the first or second argument subsequently. In Clojure, the example that I just used would take
the following form

```
(->>
2
(+ 2)
(/ 8))
```

## Data processing pipelines in R

The pipes were also a very popular pattern in statistical programming language R, where it was first available through
an external library that exposed the `%>%`

operator, but due to its heavy usage in the R community, in R 4.0.0 it was
included in the core language as `|>`

. For example, to calculate per-group averages an R user could use the following
code

```
library(dplyr)
mtcars |>
group_by(cyl) |>
summarise(mpg = mean(mpg))
## # A tibble: 3 x 2
## cyl mpg
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
```

The pipelines like above, consisting of pure functions, fulfill all the mathematical properties of function composition. We can define a new function \(p(x) = f(g(x))\) and use it in a composition \(h \circ p = h \circ f \circ g\). For the same reason, pipelines can use other pipelines as steps. This is how a program can be decomposed into a series of smaller steps in a functional architecture.

## Mutable pipelines

But there is another kind of a pipeline, the mutable (or trainable) one. They are commonly used in Python’s scikit-learn and take the form below

```
complete_pipeline = Pipeline([
("preprocessor", preprocessing_pipeline),
("estimator", LinearRegression())
])
```

This pipeline is an object with the same interface as its steps (exposing the `fit`

, `transform`

, or `predict`

methods).
When running `complete_pipeline.fit(X, y)`

, the pipeline would call `fit`

in `preprocessor`

and pass the result as an
input to the `fit`

method of the `estimator`

. Notice that the `fit`

method mutates the pipeline object. If during
preprocessing we used a scaling transformer, it would learn how to scale the data given the training set and be able
to apply the transformation to new data. Calling `fit`

on the machine learning model would lead to training it, so
the model can be used for making predictions.

## Non-mutable, trainable pipelines

We need a `fit`

method that sets up the pipeline and a `transform`

or `predict`

method to apply it.
In scikit-learn the pipeline and the objects it consists of are mutable, however, it would also be possible to create
a pipeline in a functional programming paradigm. The only thing we need is the support for first-class functions.
In such a case, the `fit`

function would return the predicted pipeline build from individual step functions.
Such a purely functional pipeline could look like in the example below (or the same example in Scheme).

```
def fit(steps, input):
new_steps = []
for step in steps:
fitted = step(input)
input = fitted(input)
new_steps.append(fitted)
return new_steps
def transform(steps, input):
output = input
for step in steps:
output = step(output)
return output
transform(fit([
lambda x: lambda y: y + x, # => y + 2
lambda x: lambda y: y / x, # => y / 4
], 2), 7) # => 9 / 4 = 2.25
```

As you can see, `fit`

and `transform`

serve completely different purposes. `fit`

is used as a pipeline factory,
while `transform`

runs a regular, non-mutable pipeline.

*OK, but what’s the fuss?*

The main reason for using pipelines is that they lead to more concise and readable code. An additional benefit is that the steps can be easily changed, replaced, or removed, which makes iterating over the code easier. Individual steps can be implemented and tested separately. The steps, like LEGO blocks, can be used to compose many different pipelines. Pipelines also ensure consistency, because they guarantee that the steps would be always invoked in the same order. It is a simple, yet powerful design pattern.