Defining, Applying, and Coordinating Data Science at MITRE
Author: Amanda L. Andrei
For a long time I have thought I was a statistician, interested in inferences from the particular to the general, wrote mathematician John Tukey in his 1962 article, “The Future of Data Analysis.” But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt.
Tukey went on to propose and define this emerging area of science: All in all I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.
Fifty years and a data revolution later, Harvard Business Review released an article by Thomas Davenport and D.J. Patil, “Data Scientist: The Sexiest Job of the 21st Century,” popularizing the idea of a new occupation, the data scientist as “a high-ranking professional with the training and curiosity to make discoveries in the world of big data.” The perfect data scientist, Davenport and Patil suggest, is an individual with deep knowledge in math and statistics, programming, and domain knowledge (such as in healthcare or cybersecurity). Some engineers jokingly call this individual a unicorn.
“There are many disciplines in data science,” explains Dr. Elizabeth Hohman, statistician and group leader within MITRE’s Department of Data Analytics. “The expected similarity between two data scientists is less than the expected similarity between two statisticians, and statistics is a wide field.”
Data Science: Statistics, Science, or Something Else?
Short answer: yes, yes, and yes. Statistics and data science follow a process of collecting data, processing and cleaning it, modeling and analyzing it, visualizing it, and using the output to make decisions. However, statistics often focuses on answering a certain question and informing its answer with data. In contrast, data science adds in domain expertise and focuses on learning what the data says about the world.
And data science earns the science part of the label by its nature of making inferences about the world based on data, considering what new data to collect and best methods to collect it, and ensuring the reproducibility and validation of the data. All these features are essential to the scientific method. Although these issues are essential to statistics as well, data science has emerged from statistics due to the need to address large and changing datasets.
And it’s the type of data that makes data science extra interesting. Often datasets are large, complex, and non-stationary. “I am a data scientist in the sense that I work on projects that make inferences from large datasets,” relates Hohman. “This is fun when the data is complex and unwieldly, and especially when you consider different types of data associated with entities such as images or texts or videos. You are trying to get each entity represented mathematically, and then making inferences is fun and difficult.”
MITRE Applying Data Science to Aviation Safety
One notable area where MITRE tackles these kinds of datasets is through the public-private partnership Aviation Safety Information Analysis and Sharing (ASIAS), a collaborative effort between the Federal Aviation Administration and industry, with MITRE as the trusted third party. For over ten years, MITRE has served as the steward and architect for the system, which is the world’s largest repository of aviation safety information. Membership is voluntary, and there are over a hundred members from commercial air carriers, general aviation operators, flight training, and government. Data is anonymized and can include anything from a plane’s speed to reports about a flight. “It’s more data than you can imagine,” says Hohman.
Working on ASIAS in aviation safety, Dr. Eric Bloedorn, a MITRE senior principal artificial intelligence engineer, describes the process thus: “There’s a stream of data coming in and we want to identify emerging trends and vulnerabilities we should pay attention to. We have a variety of tools to see what’s spiking or dropping, or something that we’ve never seen before. How do we identify the pattern from this data and rigorously characterize it?”
Not only does MITRE protect and store the data, but staff have evaluated multiple third-party tools, as well as developed tools, to assist with analysis, modeling, and visualization. Collaboration across MITRE, the FAA, and industry has been key for enhancing aviation safety.
The Need for Teams
Despite Davenport’s and Patil’s emphasis on data scientist as an individual pursuit or job, sponsors and organizations are learning that it takes a team of individuals to pursue these complex problems.
“The challenge is having a relatively broad set of skills,” says Bloedorn. With one of his projects in cybersecurity, he notes: “The only way that we were successful in doing some of this malware analysis work was that we had guys in the group who were reverse engineering experts. These domain experts are necessary to understand the meaning of different API calls. They are necessary to suggest which are the features they would use to solve the task at hand. I need really long discussions with those domain experts—it’s a team that includes skills in analytics, math, computer science, and that domain [of interest].”
This question came up so often among sponsors and agencies that Bloedorn and Bernard McShea, a lead cyber security engineer, put together a presentation for the best practices in forming analytic teams. They have found that a good team communicates frequently and in depth with each other, and it has at least the following three roles:
- Domain subject matter expert – a person who understands the data and the business need
- Hybrid data scientist and domain subject matter expert – a person who understands modeling methods, the domain, and can implement the model
- Modeling subject matter expert – a person who understands the big picture and the current state of machine learning or modeling in both academia and industry
Bloedorn likens a data science team to a football team, with tight, complementary coordination and each member being aware of each other and the dynamic changes of the game: “You can’t say for the next six plays, I’m going to block this one player —it has to be dependent on the play. Maybe you need to run right, so you need to make sure the running back can get to the right. It’s very tight coordination. [By extension], if you give data to a guy who doesn’t know the domain, and he just cleans the data, he might clean the signal right out of it. You need some deep expertise in the domain and very tight coordination with the expert.”
What’s Down the Road?
IBM reports that every day we generate 2.5 quintillion bytes of data. (That’s 2.5 million terabytes, or 2,500,000,000,000,000,000 bytes of data.) With the rate of data usage accelerating, we can agree with John Tukey that our central interest is certainly data analysis.
“We’re in this new world of data,” says Hohman, “that’s the evolution, or the thing that’s changing. The validity and reproducibility we assign to things is important. Do we have the evidence to expect, ‘this pattern will happen in the future?’ We learn a pattern or build a model from data, and we need to be testing whether the process generating the data has changed and whether the conclusions we draw remain valid.”
Bloedorn also notes that agencies and organizations are interested in artificial intelligence, machine learning, and data science, but he cautions about overselling or hyping the technologies. “It’s exciting and scary at the same time—you’re talking to the director or deputy directors about how this entire organization does its business based on machine learning and AI, and that is exciting. I think there’s real value, but how do we go in there and actually deliver impact, without disappointing them? If you promise too much and then fail to deliver on that promise, the negative impact could be significant. The wave of interest is cresting, and we’re trying to keep it from crashing too hard.” An AI winter is considered to be a period in time when, according to Jackie Fenn and Mark Raskino in Mastering the Hype Cycle, “the whole technology [falls] out of favor because it failed to live up to people’s initial, overheated expectations,” thereby causing a reduction in funding and interest in the technologies.
Could there be a data science winter? More likely, the term and field of data science will shift and change—whether labeled as big data analysis, data analytics, or the intersection of statistics and machine learning, teams will continue to practice the science of learning from data. And many challenges continue to exist for data scientists and data science teams, whether it comes to coordinating the proper experts or simply getting into a data science mindset. Learn about these issues in the next post, Challenges in Data Science – Technical and Cultural.
Amanda Andrei is a computational social scientist in the Department of Cognitive Sciences and Artificial Intelligence at The MITRE Corporation. She specializes in social media analysis, designing innovative spaces, and writing articles on cool subjects.
Public domain photo courtesy of Pixabay
© 2017 The MITRE Corporation. All rights reserved. Approved for public release; distribution unlimited. Case Number 17-4148
The MITRE Corporation is a not-for-profit organization that operates research and development centers sponsored by the federal government. Learn more about MITRE.