Just in Time Science
Author: Colin Payne-Rogers
Science is “the systematic study of the structure and behavior of the physical and natural world through observation and experimentation.” Since its emergence during the late renaissance, scientific progress has been made primarily through the aptly named scientific method. This basic empirical model has generated the modern world. The screen on which you’re reading this article, electric vehicles, central heating, a blockchain toothbrush that will let you mine coins during your daily dental routines … are all enabled by science.
As the scientific method emerged, science was communicated in a handful of ways, via private letters, public talks, and full-length books. As scientific societies emerged, communication was made more effective with the advent of the scientific journal, and with it the scientific paper. This innovation has persisted as the primary means of scientific communication for hundreds of years.
However, prose scientific papers have important limitations. They usually only provide high-level summaries of the method and results. This leaves scientists without an adequate mechanism to structure and document their research. It creates challenges to peers within the scientific community who wish to replicate those methods but lack the original data and detailed steps to do so. It has taken the convergence of these frustrated scientists, the ubiquity of computer programming, and the Internet to challenge this.
Enter Computational Notebooks
The fundament to this challenge has emerged from three ideas – interactivity, shareability, and readability. Alan Kay, a pioneer of computer science, envisioned a “[notebook] that users could program themselves” as early as the late 1960s. Although his Dynabook concept included hardware, its interactivity isn’t lost on today’s computational notebooks. Richard Stallman, a lifelong advocate for free software (and self-denied father of open source), launched the GNU project in 1983 and has helped keep free, shareable software at the forefront ever since. Last, but certainly not least, it is the ethic of literate programming from the famed Donald Knuth, “master of algorithms” and the “Yoda of Silicon Valley,” that has driven the development of modern computational notebooks.
Literate programming is simple. Written code should be readable. Computational notebooks embody this principle. Like most aspects of information technology, variations have emerged over the years. Mathematica’s Notebook is an early example. Maple’s worksheet feature is another. Both were developed in the spirit of Knuth’s literate programming. Both were introduced as components of proprietary software; both lack the shareability of open source. More recent open source examples include: Apache Zeppelin, SageMath, R Markdown, IPython, and Jupyter Notebooks (derived from IPython).
To understand what a computational notebook is, it is helpful to examine an example. A Jupyter Notebook is “an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.” A notebook is written in an HTML-compatible markup language and is powered by a kernel from your favorite programming language (Python, right?). It displays in your web browser. It is made of cells. Each cell is either a code cell, with its corresponding output, or a markdown cell. Code cells share the kernel and allow for computation to build on itself as the author builds the notebook. The aptly-named markdown cells support Markdown text formatting and LaTeX for equation editing.
Okay, but what is a computational notebook? I think of it as a candidate for the modern essay. Rooted in storytelling and born in the tradition of literature, the essay was pioneered by natural philosophers as a short, narrative form of discourse. To borrow from the etymology of the term itself, an essay is meant to put ideas or arguments on trial. It was likely in this spirit that the scientific paper evolved as a means for scientific communication. Now, though, the scientific paper is insufficient for putting modern science on trial. It cannot encapsulate the requisite data, computation, equations, theory, or results. Computational notebooks can.
Peer Review Needs to be Updated
Peer review is a natural complement to the scientific essay. In the paper, an argument is put on trial. During review, peers serve as judge, jury, and, occasionally, executioner. In this manner, the human collective has been making scientific progress for hundreds of years.
The peer review process rests on a couple of assumptions. First, a unit of science can be contained in one work. Second, people with access to that unit of science can understand, critique, or challenge it. The first is challenged as the weight of a scientific argument shifts away from the prose or the logical and towards the data (this is one marker of the broader emergence of data-ism, popularized by the best-selling author Yuval Noah Harari). The second is challenged as our ability to conduct science globalizes more quickly than our ability to share it; rapidly lifting the world out of poverty isn’t the same as opening access to our wealth of accumulated knowledge.
Computational notebooks, in their form as the modern essay, address the first assumption by capturing more of the science they communicate. In their function, by helping authors plug their science into the Internet, they address the second. They have the potential to deepen the review process and encourage broader communication of science.
Publishing Needs to be Updated
Much of science today exists behind a paywall, where access is limited to individuals or institutions who can afford it. For most of the history of the scientific journal, this was understandable. The content of a journal needed to be collected, peer reviewed, collated, and distributed. The readers needed to know about, and have access to, print copies of journals. Scientists and researchers needed to know who to talk to, what problems were being worked on, and how to collaborate with one another. The problem that scientific publishers were solving aligned well with the “calendars, pamphlets, and other ephemera” that Gutenberg’s printing press was effective at disseminating.
But, just like the invention of the printing press was a paradigm shift that fueled the enlightenment and threatened conventional power structures in Europe (hinging on the moment that Martin Luther supposedly nailed his 95 theses on the doors of churches across Wittenberg, Germany), so too would one have expected that the invention of the Internet would have fueled a paradigm shift in scientific publishing. That the structure of the publishing system would have moved from a directed or acknowledged system-of-systems and towards a collaborative or a virtual one.
It hasn’t. Progress has been made with the right technologies, but scientific publishing has been slow to open its doors. If changes aren’t made by the publishers (if science isn’t cracked open the way the Internet has cracked open so many things), computational notebooks give scientists the means to drag science kicking and screaming into the future (and if they don’t, others will).
Replication Needs to be … Installed?
Transparency in research is important. Open access to published data, computation, and results is one side of the story. The reproducibility of scientific literature is the other. There are convincing arguments that many published research findings are false – something that is especially true, and especially poignant, in the social sciences. The incentive structure for scientific publishing (remember those outdated journals) isn’t helping. Researchers struggle to replicate the results of others. Initial findings disproportionately affirm, rather than nullify, hypotheses. To the extent that these claims are alarmist in nature, science is still hard to do.
This problem may seem tangential to the application of computational notebooks to science. It isn’t. Encouraging open access to research data, methods, computation, and findings is an important front in the war against these false, or at the very least biased, or at the very least difficult to find, findings. Doing so in a robust manner is critical. Computational notebooks need to be adopted. Science needs to be opened and researchers need to be encouraged to replicate findings.
Time is of the Essence
It is hard to believe that a single technology, computational notebooks, is poised to tie together so many threads – interactive computation, literate programming – scientific papers, publishing, peer review, replication – storytelling, essays – but it is. The structure of science needs to be updated. And the sooner it can be done, the better.
Colin Payne-Rogers is an ex-data scientist and current communications engineer with a graduate degree in mechanical engineering and an interest in all things scientific or otherwise.
© 2019 The MITRE Corporation. All rights reserved. Approved for public release. Distribution unlimited. (Case number 19-0305)
MITRE’s mission-driven team is dedicated to solving problems for a safer world. Learn more about MITRE.
Building Smarter Machines by Getting Smarter About the Brain
The World as It Will Be: Workforce Development Within and Beyond MITRE
Catch You Later: Recap of the Generation AI Cyber Challenge
Phish, Flags, and Lesson Plans: Upcoming Hackathon for Generation AI Nexus
Technical Challenges in Data Science
Defining, Applying, and Coordinating Data Science at MITRE
Upgrading Machine Learning. Install Brain (Y/N)?