Data Scientists Write Bad Code or Maybe That’s Not the Problem?

blog
Published

November 24, 2021

Technologies and methodologies that worked for software engineering don’t necessarily do for data science. At the same time, we hear software engineers complaining about the quality of the data science code and the tooling used. But maybe the two disciplines are less similar than they may appear? I recently started wondering how they differ. Vicki Boykis’ post on the two models of a programmer’s brain finally inspired me to write my thoughts down. Let me put on my data science hat and play the Devil’s advocate. I’d argue that there are reasons why data science is different from programming.

The primary difference between software engineering and data science is the reason for writing the code. In data science, the code serves only as a means for solving another problem. Our work focuses on exploring the data, drawing conclusions, generating new research questions, and running experiments. Training a machine learning model is also an experiment, where we try to learn if the model would help make better predictions. Jupyter notebooks can be seen more as a laboratory journal than a production code. Sometimes the job is to write code, but I would argue that this is a different type of work.

Prototyping vs production

As noticed by Vicki after Felienne Hermans, a programmer’s work happens in two distinct modes: prototyping and production. Data science work seems to be more about the prototyping mode. For example, this applies to programming languages that

can either be easy for prototyping or easy for production. You usually can’t be both. Now, some will argue that there are languages that fit both purposes, and it’s true, there are cases of prototyping languages in production and production languages used for prototypes, but by trying to fit one in another, we lose something of the properties of both.

The distinction is not new. Extreme Programming calls the prototyping code spike (not Spark) solutions and has different rules for them compared to the production code.

A spike solution is a very simple program to explore potential solutions. Build the spike to only addresses the problem under examination and ignore all other concerns. Most spikes are not good enough to keep, so expect to throw them away. The goal is to reduce the risk of a technical problem or increase the reliability of a user story’s estimate.

Spike solutions are about moving fast to verify a hypothesis. Writing production-quality code for a spike would be a waste. That is also how we treat the Jupyter notebooks. We want to move fast to conduct an experiment, where we expect most of the experiments to fail. While you should throw away the spike solution after you’re done, that is often not the case with notebooks. Maybe our experiments aren’t like spikes?

How does it work?

Jupyter notebooks may look like a wild west to seasoned software engineers, but they are not, or at least don’t have to be like that. Test-driven development is one of the programming best practices. TDD advises writing a unit test before implementing the functionality, so you can observe how the test fails, then write the minimal implementation, see how the test passes, and finally improve it when needed. Working in Jupyter notebooks is an example of REPL-driven development1. We write the code in a notebook cell and run it to observe the result below. It gives instant feedback, the same as running a unit test in TDD does. As noticed by Saleem Siddiqui, the biggest difference is that after working interactively, we are left without tests that could be used for continuous integration2.

Why do we prefer working interactively?

In data science work, there are many unknowns. The problems are much less structured as compared to programming. Our solution depends not only on the code but also on the data and the surprises it brings.

In software engineering, we usually can (and should) split the work into small chunks. We develop and test each chunk independently (unit tests). In data science, the results depend not only on the code. To test the solution, we need to run it with the actual data and look at the results. Here we use notebooks.

To verify the result, we usually need to run the code end-to-end: download the data, preprocess it, train the model, produce validation metrics and plots, etc. It can be time-consuming. The notebooks allow us to pause the process, make changes, and move forward, rather than re-running everything. The output of each executed cell serves as early feedback or debugging information. It is efficient.

In programming, we are concerned about reliability, scalability, maintainability, security, etc. Neither of those bothers us when we need to run the research code once, for a particular dataset. On another hand, building shared libraries, automated reports, data pipelines, etc are programming tasks that need software engineering rigor (production mode).

Moreover, some things make much of the data science code easier to write than in general programming. First of all, the scope is narrower. We frequently can (and should) apply the few familiar design patterns to most of the problems (e.g. pipelines, transformer and classifier/regressor, the functional and sequential APIs in deep learning frameworks, etc). Also, the code to be written is often relatively simple, for example, feature scaling is just the basic arithmetics, filtering is a simple rule inside a loop, a deep learning model can be glued together in Keras from the predefined building blocks, etc. Even the TDD gurus like Kent Beck agree that there is no need to test the trivial code. If those things break, this is usually instantly obvious. In many cases, we do live in a rather safe programming environment.

So it’s fine?

So maybe there is nothing wrong with the data science code being fast and dirty? Not exactly. Imagine a medical research lab, where the lab technicians would need to build the equipment by themselves each time before doing the actual research. Their results would be hard to reproduce by other labs and more prone to errors. But this is a common anti-pattern in data science! To be able to focus on exploratory work, we need high-quality “laboratory equipment.”

When do you need to care about the quality? In programming, if you find yourself repeating the same code, you should abstract it to a separate function (the DRY rule). If it repeats between notebooks, move it to shared libraries. As our lab equipment, it should follow all the design standards. In preprocessing, the repeated operations can be translated into pipelines saving the standardized results to feature stores. Building a high-quality implementation of a machine learning model would spare painful debugging in the future. Finally, any code that gets deployed should be treated as any other production code. The same applies to any high-stakes reports. All such cases are closer to regular software engineering (the production mode!). Not surprisingly, these tasks are commonly delegated to data engineers or machine learning engineers, to “clean up” and “productionize” the code. Maybe, after all, research does need different tools, mindsets, and skills.

Conclusions

I remember getting so preoccupied with writing good code for the research that while I ended the day with a nice pull request, I was not able to run it yet. I missed the main point of writing it.

We de facto seem to be working differently when doing research and writing production code. Maybe instead of fighting with “bad code” and “bad programming habits”, we should acknowledge the fact and focus on the deeper reasons behind it. As with spikes in extreme programming, it might help to move faster on the unmapped territories. Using a notebook is a compromise, where we value flexibility over rigor. Depending on the circumstances, it may or may not be worth it.

“Data scientists write bad code because they lack skills” ignores many subtleties and does nothing to answer the “why” question. Trying to make data scientists be like software engineers misses the point of why we work as we do. Now, putting on my software engineer’s hat, I wonder what could be done to make the process more efficient.

Footnotes

  1. REPL stands for the read-eval-print loop.↩︎

  2. To make the notebook self-testable, one can replace each of the lines like print(features.shape) with assert statements assert features.shape == expected_shape.↩︎