In his post, “Jupyter on Kubernetes: A DevOps Perspective”, Scott Stuart discusses the growing profile of Jupyter Notebook as an analytical tool in science and some of the technical considerations around how we deploy and manage notebooks at SemanticBits. Here, we will delve deeper into the question of why Jupyter Notebook is such a useful tool, and provide a brief hands-on demonstration of its capabilities.
Notebook interfaces are not new. Long a staple of scientific and mathematical computing software, it is only more recently that notebook interfaces have become commonly encountered outside of academia and niche industries.
Jupyter Notebooks boast three advantages that make them an appealing solution:
- They can vastly increase development speed, particularly when working with data or novel problems.
- They are language and system architecture agnostic.
- They are easy to share and consume.
Increasing Development Speed
The notebook interface—which allows the user to run chunks of code stored in separate cells and share objects in memory across cells—is particularly effective when exploring novel problems or doing rapid prototyping. A user can experiment with new libraries, easily modifying code until it runs, and then create a new cell to continue onward. Being able to repeatedly re-run one or more cells, rather than re-running an entire script or executing a test suite, narrows the feedback loop for the developer and effectively allows the user to save their progress within a single “run” of their program.
This feature of notebook interfaces is particularly suited for working with data, where oftentimes the user would rather avoid repeatedly reading a large file or set of files into memory. Instead, the data can be loaded once, and all subsequent development or analysis can simply access the in-memory representation for the duration of the session.
Jupyter Notebooks provide several additional conveniences that can improve productivity. For instance, when using Python, the docstring of an object can be displayed in a separate window simply by appending a question mark to the end of the object in question. Python users also have access to a host of useful functionality inherited from Jupyter’s ancestor project, IPython. For instance, the
%timeit magic command allows you to quickly profile some code. Pythonistas can even enjoy interactive debugging within a cell using the usual
pdb.set_trace() breakpoint or with the
%debug magic command. Documentation on magic commands is available here.
All users have easy access to command-line utilities within the notebook environment, and these can be invoked within cells by prefixing the command with an exclamation point (e.g.,
!ls). Even better, in-memory data from Python cells run earlier are treated as environmental variables and thus can be accessed using the
$ prefix. This functionality allows users to seamlessly intermingle Python logic and usage of powerful Unix utilities, substantially simplifying file system operations and other tasks where these utilities excel.
Language and Architecture Agnostic
Scott Stuart has already written at length about the scalable architecture that underpins our use of JupyterHub. This is possible because the user-facing interface is built on top of web technologies, abstracting away any direct dependency on the system implementation that communicates with the frontend. As far as the user is concerned, working on a notebook locally looks and functions much the same way a notebook would run on an ephemeral instance within a virtual private cloud (VPC). As an added bonus, since it is essentially a web application, Jupyter benefits from improvements made throughout this ecosystem.
Another advantage to the kernel-driven approach to notebook execution is that it paves the way for possibly enabling cell-level kernels. With recent advances in the Apache Arrow project, which is a columnar in-memory storage standard, we may one day be able to work within a notebook where we use Python to munge the data in one cell and use R to plot the cleaned data in the next. Perhaps this is less far-fetched than it may at first appear, with two titans of both the Python and the R data world collaborating.
In 2005, John Ioannidis published a paper titled, “Why Most Published Research Findings Are False“, in which he demonstrates, with basic logic and probability, how easily the body of research within a field may be dominated by false findings. Over the ensuing decade, a growing number of studies across the social and medical sciences have attempted to replicate the results of some of their most influential experiments in order to assess the validity of these studies. While replication rates vary across areas of study, the general finding has been that replication rates are well below what their statistical significance would predict, lending support to Ioannidis’s argument and giving rise to the replication crisis.
One way that has been suggested for improving the quality of research is to make scientific research more transparent by openly sharing research methods, data, and code. In this regard, Jupyter has several substantial advantages over other common statistical analysis tools. The fact that Jupyter is free and open-source software supports this transparency both directly and indirectly. It contributes directly to open science because it is freely available, allowing anyone to have access to the software used to produce the research results. As part of the open-source software world and as a tool built using other open-source technologies, Jupyter indirectly contributes to transparency by encouraging further development of tools targeting its platform in the public domain.
On a more technical level, Jupyter also makes it easy to share analyses and research by leveraging web technologies to render its user interface. Notebooks are stored as JSON objects, with base64 encoded images for ease of portability. Markdown may be used in non-code cells, providing a convenient and flexible way to provide commentary and narrative for an analysis. By automatically converting JSON into HTML for rendering, Jupyter Notebooks may be published to the web with little fuss. GitHub even renders notebooks directly in-browser from the underlying version-controlled notebook JSON. If other formats are required, notebooks may easily be converted into PDFs or raw Python code. Perhaps most importantly, the output of code cells is stored directly in the underlying JSON representation of the notebook. As a result, the consumer of a notebook published to the web need not install anything to inspect the code and the code output, greatly simplifying the process of sharing and reviewing results.
All of these advantages were compelling enough to encourage Paul Romer, former World Bank chief economist, to adopt Jupyter and Python as his statistical analysis and research dissemination tool out of frustration with the existing proprietary statistical software tools so common in academia.
Working with Jupyter Notebooks
To give you a sense of what working within a Jupyter Notebook environment feels like, this notebook will do a quick exploratory data analysis of a publicly available CMS payments dataset.
How we use JupyterHub and Jupyter Notebooks at SemanticBits
On the Quality Payment Program (QPP) Analytics and Reporting project, Jupyter Notebooks have become the tool of choice for responding to impromptu stakeholder requests for data, troubleshooting and validating data ingestion issues, and data-centric communication both internally and externally. These use cases all rely, to various degrees, on the support the tool provides for rapid development, communicating a technical narrative, and portability.
One area of particular success is pairing with subject matter experts (SMEs) in meetings to interactively explore data. In an almost conversational manner, the technical user can respond to the questions of the SME in real-time with data, working through exploratory data analysis in much the same way as in the demo above. As the session proceeds, discoveries and further questions can be documented in markdown cells for later review.
JupyterHub has also had a pronounced impact. Thanks to the hard work of Scott and our DevOps team, any advanced analysis in Python or R (or other supported languages) is possible without moving any of the data outside of the VPC. Analyses can be done completely via the web interface, with the Jupyter Notebook reading data via queries into memory of the remote box and writing the resulting output (Jupyter Notebook, visualizations, cleaned data) directly into our remote file management system. At no point is it necessary to download the data locally, which greatly improves the security around sensitive data.
Jupyter Notebook and JupyterHub are part of a powerful ecosystem of open-source, largely language-agnostic tools that can improve development speed and make communicating ideas in a reproducible manner straightforward. These tools improve the way SemanticBits iterates on analyses and communicates with internal and external stakeholders. More important, these tools have an even greater role to play in improving the quality of scientific research in the future. If recent LIGO open-science initiatives are anything to go by, Project Jupyter has a bright future ahead of it.