In mathematics, two functions \(f\) and \(g\) can be composed \(f \circ g\), what is defined as
\[ (f \circ g)(x) = f(g(x)) \]
In the same way, functions in programming can be composed into pipelines.
Unix pipes
As a design pattern in programming, they were popularized by unix pipes, where a series of commands can be composed using the pipe |
operator. For example, the command below would count the unique cells from the second column of a CSV file by combining the cut
, sort
, and uniq
commands.
cut -d, -f2 data.csv | sort | uniq -c
The pattern was a consequence of unix philosophy, which assumed the workflow composed of chained programs
- Make each program do one thing well. […]
- Expect the output of every program to become the input to another, as yet unknown, program. […]
Pipelines in functional programming
Pipelines are also popular in functional programming languages. For example, Haskell uses syntax inspired by mathematical notation (f . g)
. OCaml has the |>
pipe operator defined as an inflix operator
let (|>) v f = f v
When using it, v |> f
gets translated to the f v
function call, so 2 |> (+) 2 |> (/) 8
becomes (/) 8 ((+) 2 2)
. Clojure uses the threading macros ->
and ->>
that pass the input as the first or second argument subsequently. In Clojure, the example that I just used would take the following form
->>
(2
+ 2)
(/ 8)) (
Data processing pipelines in R
The pipes were also a very popular pattern in statistical programming language R, where it was first available through an external library that exposed the %>%
operator, but due to its heavy usage in the R community, in R 4.0.0 it was included in the core language as |>
. For example, to calculate per-group averages an R user could use the following code
library(dplyr)
|>
mtcars group_by(cyl) |>
summarise(mpg = mean(mpg))
## # A tibble: 3 x 2
## cyl mpg
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
The pipelines like above, consisting of pure functions, fulfill all the mathematical properties of function composition. We can define a new function \[p(x) = f(g(x))\] and use it in a composition \[h \circ p = h \circ f \circ g\]. For the same reason, pipelines can use other pipelines as steps. This is how a program can be decomposed into a series of smaller steps in a functional architecture.
Mutable pipelines
But there is another kind of a pipeline, the mutable (or trainable) one. They are commonly used in Python’s scikit-learn and take the form below
= Pipeline([
complete_pipeline "preprocessor", preprocessing_pipeline),
("estimator", LinearRegression())
( ])
This pipeline is an object with the same interface as its steps (exposing the fit
, transform
, or predict
methods). When running complete_pipeline.fit(X, y)
, the pipeline would call fit
in preprocessor
and pass the result as an input to the fit
method of the estimator
. Notice that the fit
method mutates the pipeline object. If during preprocessing we used a scaling transformer, it would learn how to scale the data given the training set and be able to apply the transformation to new data. Calling fit
on the machine learning model would lead to training it, so the model can be used for making predictions.
Non-mutable, trainable pipelines
We need a fit
method that sets up the pipeline and a transform
or predict
method to apply it. In scikit-learn the pipeline and the objects it consists of are mutable, however, it would also be possible to create a pipeline in a functional programming paradigm. The only thing we need is the support for first-class functions. In such a case, the fit
function would return the predicted pipeline build from individual step functions. Such a purely functional pipeline could look like in the example below (or the same example in Scheme).
def fit(steps, input):
= []
new_steps for step in steps:
= step(input)
fitted input = fitted(input)
new_steps.append(fitted)return new_steps
def transform(steps, input):
= input
output for step in steps:
= step(output)
output return output
transform(fit([lambda x: lambda y: y + x, # => y + 2
lambda x: lambda y: y / x, # => y / 4
2), 7) # => 9 / 4 = 2.25 ],
As you can see, fit
and transform
serve completely different purposes. fit
is used as a pipeline factory, while transform
runs a regular, non-mutable pipeline.
OK, but what’s the fuss?
The main reason for using pipelines is that they lead to more concise and readable code. An additional benefit is that the steps can be easily changed, replaced, or removed, which makes iterating over the code easier. Individual steps can be implemented and tested separately. The steps, like LEGO blocks, can be used to compose many different pipelines. Pipelines also ensure consistency, because they guarantee that the steps would be always invoked in the same order. It is a simple, yet powerful design pattern.