Data Science Practitioners
Public domain photo from Unsplash
Author: Amanda Andrei
In earlier posts, my colleagues and I discussed defining, applying, and coordinating data science at MITRE, as well as the technical and cultural challenges that data scientists and teams encounter. I had the chance to talk to several practitioners at MITRE and learned about whether they see themselves as data scientists (or not), overarching perspectives they bring to their everyday work, and concepts that niggle their brains.
“Do you see yourself as a data scientist?”
I received an array of answers on this question, which is telling of the diversity and complexity within this emerging field.
“You know what, I don’t,” explains Imanuel Portalatin, an engineer from the Department of Data Engineering and Biometrics. “I’m more a multidisciplinary systems engineer really focused on data.” Starting out as a software engineer contractor for the Department of Defense, Portalatin worked more and more on modeling real world data for the military, eventually earning a masters in systems engineering and transitioning to MITRE. “I like to leverage data science techniques,” Portalatin explains. “We’ve proposed different projects, all aimed at: how do we facilitate these super algorithms to come up with cool solutions?”
For another perspective, Tony Donadio from the Department of Data Analytics describes himself as an “aspirational data scientist.” “I’ve been with MITRE for a long time and done many different things,” he explains. “This is a role that I’m growing into by leveraging skills from my previous work.” Donadio has a math and physics background from over a decade with MITRE’s Quantum Information Science Group. He also has a background in data visualization, and has brought those skills to bear on multiple assignments.
And when I asked Mike Shea, an engineer from the Global Operations Division, about whether he considered himself a data scientist or not, he burst out laughing. “I’ve often referred to myself as a champion of excel pivot tables,” he joked. “Data munger, that’s not so bad. I could live with that. Okay, data wrangler. Kind of has an Old West vibe to it.” When the laughter subsided, Shea explained that while he may use data science techniques, it’s usually the more simple ones – which often get good results within the time and cost frames.
Bringing Simplicity and Creativity to Data Science
One of Shea’s projects involved working for a customer who, every day, receives tens of thousands of unstructured text documents – a document that is not organized in a pre-defined manner and has irregularities that makes it difficult for a computer to parse. Even the metadata – the data that describes the other data, such as tags on the document – was inaccurate. Where to even start?
The team observed that every document had a title with text that could identify the document type, so they began parsing the title and using basic pattern matching to identify the documents. “This isn’t space age deep neural net stuff,” Shea explains, “these are low-level sorts of things – metadata analysis and basic pattern matching.” But as a result, they found that 20 percent of the data were redundant and could be removed. They also found that, likely, another 20 percent could be passed up for further processing. Upon telling the sponsors, they found that they did not need to collect that data anymore, which reduced the data ingestion by nearly half – a huge save for resources! “I’ve found that the simplest things we do often result in the best output,” notes Shea. “Just parsing the title had a huge impact for our sponsors.”
Similarly, Portalatin stresses the need to have simple yet deep comprehension for users of the data. Some of his current projects involve automating tools in the cloud – the software and services running on the Internet – and developing natural language processing tools to help analysts. “There’s a danger in data science: we can come up with all these visualizations or statistical analyses, but if they don’t make sense to the user or help them in their mission, then what use are they? It’s tricky because you want to highlight all the complexity, but at the same time, users have to know this stuff isn’t magic.”
Donadio also acknowledges the difficulty in visualizing complex data and suggests bringing an artistic creativity to the scientific creativity of data science. “The best software engineering and analysis is the result of a creative process,” Donadio relates. “It’s not just numbers and algorithms. Visualization is often about thinking, and trying to get people to see things, in a new and sometimes more intuitive way.” This approach to combining art and science especially came into play with one of Donadio’s recent projects that combined gaming and data collection for the purposes of improving quality of life for children with cerebral palsy – which we’ll dive into more in an upcoming post.
In the next series of posts focused on applications in data science, we’ll take a look at Donadio’s project, as well as chat with Portalatin about his research interests in anti-fragility, and with Shea about his questions on relevance and anomalies.
Amanda Andrei is a computational social scientist in the Department of Cognitive Sciences and Artificial Intelligence. She specializes in social media analysis, designing innovative spaces, and writing articles on cool subjects.
© 2018 The MITRE Corporation. All rights reserved. Approved for public release; distribution unlimited
The MITRE Corporation is a not-for-profit organization that operates research and development centers sponsored by the federal government. Learn more about MITRE.