Technical Challenges in Data Science
As Amanda Andrei mentioned in her previous post, Defining, Applying, and Coordinating Data Science at MITRE, we are generating 2.5 million terabytes of data a day, and the need for data science teams and individual contributors is crucial for moving what we find up the spectrum to knowledge that we might usefully share. Technical and cultural challenges, too, arise in our attempts to understand and harness this new resource. Andrei continues her discussions with experts Dr. Elizabeth Hohman, statistician and group leader within MITRE’s Department of Data Analytics, and Dr. Eric Bloedorn, senior principal artificial intelligence engineer.—Editor
Public domain photo courtesy of Pixabay
Author: Amanda L. Andrei
Many technical challenges arise when doing data science. Here are some that MITRE has experienced in the work we’ve done in this area.
Dealing with high-dimensional space (aka, dealing with data that has many, many features)
According to the Oxford Dictionary of Statistics, the term “curse of dimensionality” was coined by mathematician Richard Bellman in 1961 to describe “the difficulty of obtaining accurate estimates when there are many parameters to be estimated simultaneously.”
Dimension is simply another word for features, which are traits or characteristics. For instance, features of a car could be weight, color, miles per gallon, number of seats in the car, the amount of change in the glove box, whether or not the car has floor mats, and so on. As our access to data has increased, the features that are collected have increased, too.
Our instinct may be to say, “The more data, the better!” But in reality, the more of these features that you have, the more you have to address several problems, namely, spurious correlations and the need for exponentially larger sample sizes.
Spurious correlations. Let’s take our car example. Say we want to make a model that predicts the make of the car, whether it will be a Toyota or Ford. We record the features mentioned above, as well as thousands of others. The more of these features we collect, the higher the chance that we’ll find a spurious correlation between the features and the make—that is, we increase the chances that we’ll learn idiosyncrasies in the training data that don’t generalize to the rest of the world. The amount of dirt on the car might correlate with the make of the car, but it may be due to coincidence or a confounding factor. These two features then have a spurious correlation. Trouble then happens when we deploy the model in the real world: it won’t work as well because it’s learned the specifics of this dataset (or, say, “learned the specifics of the sample of data it was trained on”).
Needing exponentially larger sample sizes. Heads up: this concept is very difficult to visualize and is best explained mathematically.
Let’s stick with our car example. Suppose we collect data from many cars on the road, and we want to determine whether a car is a Toyota, a Ford, or another make. Suppose we only use one feature—let’s say miles per gallon. This could then be plotted in 1-dimensional space, i.e., along a line. However, only using one feature, plotting cars based on their MPG, may not separate the Toyotas from the Fords. If we add a second feature, say, weight, we may get a better 2-dimensional plot, i.e., the data in this 2d plane may allow us to separate some of the Toyotas from the Fords. Another feature such as horsepower leads to a 3-dimensional plot, i.e., a cube. There is now even more space for the cars to take up and more room to separate the cars by their make.
As features are added, the space grows, but then there are more ways for a model to overfit the data, homing in on the specificities and idiosyncrasies in the training data. If we imagine the data as a cloud within the space, Hohman says, “There’s more room outside of the cloud of data for an anomaly to occur. When you move to high dimensions, all the data is anomalous because the space is so sparse.”
Keeping in mind the large number of variables in a dataset should remind data science teams that even if they have a large amount of data, they should take care in how the data were sampled, regardless of whether they collected the data or not.
Training models on the right data. Because data is so varied, sometimes it is tempting to substitute a different kind of data from what is called for, or go after “easy” datasets on which to train models. Bloedorn relates an example of a cybersecurity project during which researchers studying malware used a training dataset of executable files that were already explicitly labeled as malware—as opposed to using a dataset with files that might be harder to classify as malware. The model seemed to perform well, but ran into problems. “When you field it in the real world, it doesn’t work very well,” Bloedorn states. “You need a training set that includes not just examples at the extremes, but right in the middle.”
Hohman adds, “You want to measure a response to input B, and you don’t have a response A, so you say, I’ll substitute a response C to that, and you’re replacing this thing you can’t measure with something you can, but you’re not answering the question you originally set out to answer.” What that means for those of us who need the data is that researchers may get fast results, but they may not be valid. Evaluating these models—and using good data in the first place—is crucial to conducting data science properly.
Updating models constantly. MITRE has been the steward of the Aviation Safety Information Analysis and Sharing system (ASIAS) for more than 10 years, and aviation safety has evolved since then—not only in terms of technology, but in terms of vocabulary and how engineers and decision makers describe the problems at hand. When this happens, older models might start missing current issues. In other domains, such as fraud or cybersecurity, this issue is further exacerbated with the activities of adversaries. “There’s only dawning understanding of this problem of I’ll just put out a model once and be done,” Bloedorn notes, “it’s, I have to put a model our periodically and keep evaluating it. I need to keep track of concept drift.” This problem of concept drift is not always recognized, but one that we have addressed in these domains because model maintenance is vital for production analytics.
Watch for Cultural Challenges in Data Science, coming in January 2018
Amanda Andrei is a computational social scientist in the Department of Cognitive Sciences and Artificial Intelligence. She specializes in social media analysis, designing innovative spaces, and writing articles on cool subjects.
© 2017 The MITRE Corporation. All rights reserved. Approved for public release.
Distribution unlimited. Case number 17-4601
The MITRE Corporation is a not-for-profit organization that operates research and development centers sponsored by the federal government. Learn more about MITRE.