In February 2016, the Laser Interferometer Gravitational Wave Observatory publicly announced a landmark finding: for the first time ever, scientists had detected the existence of gravitational waves, confirming a major prediction of Albert Einstein’s 100-year-old general theory of relativity. In the shadow of this watershed moment was a more terrestrial but nonetheless important concern within the realm of computing—data from the discovery had been published online within a day, allowing the general public to pore through the analyses in the form of a Jupyter Notebook.
Jupyter is an open-source platform that allows users to create interactive “notebooks” that combine programming code, code outputs, visuals, and narrative text in a single, shareable document. Jupyter and similar technologies have been catching on in a big way, to the extent that The Atlantic recently implicated them in the death of the traditional scientific paper, and Nature proclaimed Jupyter to be the data scientist’s computational notebook of choice.
Jupyter also has a huge presence in universities, and is catching on enough in the corporate world to grab the attention of Forbes. You can even sit in your living room and re-create from source data the sound of two black holes colliding, 1.3 billion years in the making and captured by one of the most sensitive instruments ever designed!
Here at SemanticBits, Jupyter has been making waves of a different sort. As a DevOps Engineer on a team working with the U.S. Department of Health and Human Services (HHS) to provide analytics and reporting for the Medicare and Medicaid programs, I’ve had the pleasure of adding Jupyter as a part of this platform over the past several months. On a technical level, our goal is to unify complex and diverse data sets into one comprehensive platform, and Jupyter is just the tool for the job.
It’s an exciting time to be working with HHS, as the agency positions itself to be a leader in government use of analytics and data science. Over the past couple of years, we’ve seen HHS leverage data analytics to prosecute the largest healthcare fraud enforcement action in American history; data is likewise used to identify opioid fraudsters and direct treatment efforts; and my team’s particular project, the Quality Payment Program, depends on accurate and timely data to provide a modernized payment system and improve care delivery and coordination.
As more and more aspects of society and governance become data-driven, the need for transparency, reproducibility, and collaboration becomes ever more critical. Jupyter Notebooks are a great tool not only for “showing your work” but for experimenting with changes in real-time and allowing others to do the same.
“Jupyter makes everything so much easier,” says SemanticBits Data Scientist Paul Garaud. “And, as a result, it reduces the barriers to best practices, like writing up our findings in a way that’s presentable and collaborative.” Paul was one of our first Jupyter users, and we’ll be hearing more about Jupyter from his data science perspective in a future blog post.
I’m excited to learn more about all the different ways our staff and partners are using Jupyter. Before we even got to this point, though, our DevOps practice was put fully through its paces in standing up the environment, customizing it to meet our needs, and codifying all of this into automated and repeatable builds.
Dan Geer, Security Chief of CIA venture-capital firm In-Q-Tel, has updated an old engineering maxim for the modern age: Freedom. Security. Convenience. Choose two. A lot of the work of our DevOps team could probably be expressed as a never-ending optimization of that request, and—in conjunction with Kubernetes and our own customizations—Jupyter may even have the potential to give us the hat trick.
But first things first: how do we secure this thing? Healthcare data breaches in particular are on the rise, so of course security is paramount to everything we do. Jupyter provides users with a freeform programming and shell environment running within our VPC, with direct access to sensitive databases and other resources, creating an interesting security-review scenario: behavior that would normally be looked for as a vulnerability—the ability to execute arbitrary code—is not only a feature but a major part of the software’s whole raison d’être!
Our Jupyter deployment uses JupyterHub, a multi-user version of Jupyter we’re running on Kubernetes to provide each user with their own isolated environment, with private persistent storage, right in our VPC. Following and expanding upon the excellent Zero to JupyterHub with Kubernetes guide, we combined a number of factors to ensure that the security of this setup meets the requirements for dealing with protected information:
- Private, in-house Kubernetes cluster built from the ground up to comply with project security standards
- Kubernetes security hardening
- Extensive infrastructure and instance-level security hardening
- Comprehensive penetration testing
- Patching JupyterHub source code to secure all data in motion
- At-rest encryption on all Kubernetes and Jupyter volumes, including user storage
- Network-overlay encryption
- Granular, multi-factor access policies for object-level storage
- Kubernetes network policies used in conjunction with proxies to allow Notebooks to connect only to authorized databases
- Automated database account management to ensure permissions are up-to-date and that users have exactly the access they need
- LDAP integration and whitelisting to ensure only authorized users can access the system
- Custom user libraries promoting best practices for password management and database connectivity
We use Ansible to orchestrate all of this, ensuring that our configurations are consistent across all environments and easily updated. With the help of our dedicated security engineers, this allows us to respond quickly to any new vulnerabilities that are found.
Running Jupyter on Kubernetes also presents intriguing challenges and possibilities with regards to resource provisioning. How can we strike the best balance between providing Jupyter users the memory and horsepower they need to do their work, while also keeping cloud expenses to a minimum? The answer is auto-scaling, of course. Although, Jupyter throws an interesting twist into the mix here.
Unlike a lot of the auto-scaling we’re doing on other applications, Jupyter is not amenable to the usual metric-based paradigms, in which servers are removed or added at-will to maintain a certain load. When a user fires up a Jupyter Notebook, it’s really more like a pet than a cattle-beast; though it’s true that a user’s Notebook pod can be torn down and restored again quickly, doing so in the middle of a session would potentially interrupt computation, and so we must consider sacrosanct any node running at least one Jupyter Notebook.
Thankfully, the Kubernetes team already has a solution available for this: the Kubernetes Cluster Autoscaler. The cluster autoscaler (CA) mimics the real Kubernetes scheduler, taking into account factors such as pod resource requirements and evictability, allowing a Kubernetes cluster to scale in a way more intelligent and pod-aware than you would get with standard cloud-provider autoscaling. By setting up the CA and adjusting Jupyter’s pod disruption budgets, we’ve been able to ensure that our infrastructure will scale out only when a new instance is needed to accommodate more Jupyter Notebooks; will scale in when instances are either unused or whose workloads can be consolidated onto other instances; and, perhaps most important, will refrain from scaling-in if doing so would interrupt someone’s Jupyter Notebook.
As our Jupyter service continues to evolve, we’ll be looking at ways to provide vertical flexibility in addition to the horizontal scaling we get with the CA. One limitation with our present setup is that there is no vertical scaling available, or even really possible in a traditional sense; a user’s Jupyter Notebook session is contained in one Kubernetes pod, and therefore must live on a single, uninterrupted physical machine. This can present some challenges when serving a team of data scientists working with large data sets. What if a user needs more memory or CPU than is available on a single instance?
One near-term enhancement we’re working on is to add a Spark kernel to our Jupyter environment, allowing some work to be offloaded to our Hadoop cluster. We’re also looking at allowing Jupyter users to request specific resource levels at login time, so extra-heavy workloads can be supported economically. Finally, we’re looking forward to using the new pod priority and preemption features of Kubernetes, which will allow the CA to spin up new hardware instances in advance of demand, eliminating wait time to start a new Notebook when the cluster is busy.
Not unlike the LIGO detectors themselves, we’re still growing towards full maturity, and are working hard to sustain our initial success. Our users are thrilled with how Jupyter makes it easy to step through logic, uncover assumptions, explore what-if scenarios, and generally peel back the onion of a solution or result set. Jupyter’s versatility continues to impress us. And whether your corner of the universe is measured in parsecs or people, any tool that can bring a little more understanding to the table is worth looking into.