Interview with Ali Zaidi on designing lessons in artificial intelligence

Cameron Boozarjomehri (left) interviewing Ali Zaidi (right). Photo: Martin Buitrago

Interviewer: Cameron Boozarjomehri

Welcome to the latest installment of the Knowledge-Driven Podcast. In this series, Software Systems Engineer Cameron Boozarjomehri interviews technical leaders at MITRE who have made knowledge sharing and collaboration an integral part of their practice.

Ali Zaidi is a MITRE data scientist tackling an interesting challenge for MITRE as part of his work for Generation AI Nexus. As the fields of machine learning and data science have grown, machine learning education has become a necessity in fields few would associate with computer science. The challenge is this: how do you train college students in history, business, public health, music, and so forth to use tools pioneered by computer scientists? That is the exact question Ali was tasked with solving, and in this discussion, he’ll guide us from informal idea to polished lesson!

Click below to listen to podcast:

Podcast Transcript
Cameron:	00:15	Hello everyone, and welcome to MITRE’s Knowledge-Driven Podcast. I am your host, Cameron Boozarjomehri, and today I am joined by data scientist, Ali Zaidi. Did I pronounce that right? Awesome.
Cameron:	00:25	Unfortunately, our listeners cannot hear when you do a thumbs up. Ali works in MITRE’s data analytics department, is that correct?
Ali:	00:32	Yes, that’s correct.
Cameron:	00:34	So I thought it would just be fun to do a little background on you. Maybe you’d like to introduce yourself and how long you’ve been at MITRE, what kind of things you’ve been doing since you got here?
Ali:	00:42	Sure. I’ll begin by stating that my one-year anniversary at MITRE was just yesterday. And so, I’ve officially been at MITRE for one year. I am a data scientist in our data analytics department here. I recently graduated with my Master’s in Data Science from UVA, and before that, I finished my undergraduate at George Mason University. Yeah, that’s my background.
Cameron:	01:06	One year at MITRE is both a significant and insignificant milestone considering we have some people of very long tenure, but it seems like there’s a lot that you can do in a year. So like, getting to the end of the year and getting to take a minute to look back on all the work you’ve gotten to do since then is pretty exciting.
Cameron:	01:23	As I understand it, you’ve gotten to work on some pretty interesting projects?
Ali:	01:27	Yes. Yeah, I completely agree with you, one year is both a very long amount, yet a very short amount when you think about how quickly it goes by when you’re working on very interesting work.
Ali:	01:38	Yeah, I’ve had the pleasure of working on multiple projects here at MITRE since I’ve started and all of them have been super interesting, and super challenging so far, and I’m excited to get into them with you.
Cameron:	01:49	I’ll start with this, we’ve had a lot of conversations on this podcast about the Symphony platform and Generation AI Nexus. As I understand, you are a member of our Generation AI Nexus team. I think that was a big part of what I wanted to talk to you about today.
Cameron:	02:07	We hear a lot about it as a platform, a place where people who have experience or want to gain experience in machine learning, or share machine learning with the masses, train the next generation of data scientists and, I guess, data engineers. But we don’t really get a good sense of what does a lesson, what does an actual thing that you learn inside of the Nexus itself, look like. It turns out that you may be the perfect person to answer that question.
Ali:	02:35	Perfect question. Yes, I currently work on that platform in helping design different course work to tie in data science, and AI, and machine learning into the domain knowledge of that course. For example, I’ve worked with one professor (Dr. Phil Barry, George Mason University) on a project management class to create data science course work for him that ties in data science to project management. I have no experience with project management, but it was a really interesting challenge in learning how project management is taught and how it works, and then inserting data science into that lesson plan.
Cameron:	03:10	Yeah, so I think that would be the exact place we want to go. Can you tell us more about … first of all, project management. I feel like there’s lots of different meanings for that word. Do you want to elaborate more on what exactly the project management aspect of that work was?
Ali:	03:24	Sure. Project management as a discipline is very involved in understanding how to mitigate three things. One is they’re trying to understand how to address time when undertaking a new project, or a current project. They want to address money, and they want to address risk. Those three things serve as the core of understanding project management. You want to be able to understand where in a new project you’re going to be spending a lot of money, or you might need to keep a little bit of money saved to address those problems in the future.
Ali:	04:01	You want to understand how much time you’ll need, and if there will be any delays in different parts of the project as you plan. And you also want to be able to understand what other risks you have. Do you have people risks, in terms of people maybe leaving the project, or new people being needed? Are there problems with other components of the project? Let’s say that this is a project in data science. You need data, and sometimes the data comes to you in a format that you might not have expected it, so it requires extra man hours to clean. Maybe the data came unfinished, or it wasn’t fully there.
Ali:	04:35	Sometimes you need to be prepared for those things, and project management tries to mitigate that by creating a whole discipline around mitigating those three things, mitigating risk in those three different categories.
Ali:	04:47	When we first started, I actually only knew project management from a corporate perspective, which was, okay, we’re going to use this project management work flow in a new project, and this is how it’s built out. A lot of people will associate something like a Gantt chart with project management, and that’s one of their currencies in project management.
Cameron:	05:08	If you don’t mine me asking you real fast, can you explain a Gantt chart to anyone who might not be familiar with it?
Ali:	05:13	Sure. A Gantt chart is a chart used for delineating different tasks over time, and you can view those tasks concurrently such that you have multiple tasks laid out and where they overlap in the time frame.
Cameron:	05:30	It usually looks like, if you can imagine a bar graph where the bars are kind of …
Ali:	05:35	They’re scaled.
Cameron:	05:36	Yeah.
Ali:	05:36	They’re scaled for time.
Cameron:	05:38	The length of the bar represents the amount of time it’s expected to take, and all the bars are kind of stacked neatly next to each other, so you have a good idea of like, June, July, or whatever time frame. You can see when a bar ends is when that work is expected to end.
Ali:	05:50	Mm-hmm (affirmative). Exactly. What this allowed us to do is take data science and machine learning techniques, and applied them to a whole new data set, and provide students in this project management class the ability to understand how they can use data to mitigate risk in the future.
Ali:	06:11	What we did is we (MITRE instructional designer Joe Garner and myself) found a data set that already existed. It had a list of project failures from a project management perspective. And it had a lot of different variables, or data points, on why the project failed, the length of the project, the country of the project, and other relevant factors.
Ali:	06:33	What we were able to do is supplement that data set with some synthetically generated data and provide that with a Jupyter Notebook lesson plan to the professor, and he was able to deploy that. We walked the students through that notebook and gave them a chance to understand how they can use the data, and analyze it further to make conclusions and mitigate risks in the future.
Cameron:	06:58	I’ve got to say, there’s a lot to unpack there, because I think you spoke to the full breadth of the Generation AI Nexus platform in that one description for this one project. Let me just start by saying: I love this idea for machine learning, because we at MITRE, we’re always dealing with kind of the cutting edge of technology, or new technologies that no one’s really been playing with or using. It can be impossible to clearly explain to a sponsor, “This blockchain solution, this machine learning solution,” whatever, “we expect the progress to look like this.”
Cameron:	07:28	You can set expected milestones, but as a technology changes or as you’re learning about new ways to adapt it, it can be very difficult to follow the actual expected progress. Because literally no one has ever worked on it before.
Ali:	07:43	Exactly. I completely agree with that. If you look at the current workplace, and projects that we’re seeing, there’s a big need for people outside of our current disciplines to branch out and have these skills so that our sponsors can use them.
Ali:	07:59	For example, although it’s great being a data scientist, I might not have the domain knowledge as a person who studies maybe history, or biology, or some other discipline where there’s no expected level of technological readiness in data science or programming. If you give them the skills to be able to use their domain knowledge in conjunction with data science skills, then you can provide a very valuable resource for our sponsors.
Cameron:	08:30	Yeah. I think that’s an excellent point to just how MITRE’s trying to bring these kinds of technologies to sponsors, not necessarily just as academia, but also as a tool for anyone who’s hoping to train their workforce in machine learning to see what those goals and benefits look like.
Cameron:	08:46	Going back to something else you said, you mentioned you had supplemented a data set you had already found with synthetic data. Can you speak both to the why and how you did that?
Ali:	08:55	Sure. In data science, one of the biggest key problems is not having enough data.
Cameron:	09:01	Mm-hmm (affirmative).
Ali:	09:04	This means that you need to be able to (a), either have enough data or (b), use different methods than the ones provided by data science. A lot of data science algorithms and methods require a very large amount of data to product any actionable insights. Unfortunately, with the data set that we received, it only had about 800 to 900 rows of data and that wasn’t enough to provide enough of an analysis for the students.
Ali:	09:33	What I did is I looked at the data that we had, and I created more data based on the statistical properties of that original data set. I made sure that the statistical properties of the original data set, such as the mean values for some of the columns or the mode of some of the other columns, they stayed representative of the original data set when I created more rows. What this allowed us to do is vary the data set a little bit more so that the students had a little bit more data to play with, and then it also allowed us to use other methods to analyze that data.
Ali:	10:13	The interesting challenge we actually had with this data set was that there were a lot of data points that skewed it very much, because the data was collected mostly in the United States, and so we didn’t have as many projects as we wanted from other continents. To deal with that, we added an aspect of analyzing the data such that the students had to use their own analytical skills, and think through the problem and say, “Okay, I only need the data from the United States and not from these other regions.” Because the problem that they were trying to tackle occurred in the United States as well.
Ali:	10:51	What this allowed us to do is give them a very open-ended data set, and we had them play with it so that they could get some insights from it that would be more relevant to the problem they were trying to solve. Just to give some background on what the students were trying to solve, they had a project from the professor that was trying to create a project plan for an autonomous bus system.
Cameron:	11:14	Mm-hmm (affirmative).
Ali:	11:15	What this autonomous bus system would do, it had to have a certain budget, the students had to have these buses run between certain geographic area, and they had to create a whole project plan to mitigate the amount of risks that’ll happen in the future.
Cameron:	11:30	Just to do a little explanation of what’s going on here, this is still a project management course, but we’re trying to introduce this hypothetical technology, which is this autonomous bus. In this scenario, you’re saying, given these other projects, these other forays into modern technologies or emerging technologies, what can we discern or eke out about this new project we’re trying to take on in terms of the costs, the impact, the time, and just the other elements that I believe you mentioned were in the data set?
Ali:	12:01	Exactly. What we want to do is we want to be able to use the data of past projects and why they failed, and use data from that, and apply what we learned from that, into this new data set and the new problem that the students were trying to tackle.
Cameron:	12:18	Now, I think a big thing about this lesson plan, I guess, that still blows my mind is you’re teaching students, really students of any level, about how to use machine learning, to learn this. But what does creating a lesson plan, where machine learning is the focus, and you really want them to not necessarily just know how to play with data, but how to apply machine learning best practices. What does that conversation or process look like?
Ali:	12:46	Sure, yeah. That’s a great question. I would say that the first thing to be mindful of is a lot of the students in this class had never touched Python before, and so we had to start the lesson with some basic Python skills, as well as Jupyter Notebook skills.
Cameron:	13:01	I should explain very quickly, Python, for anyone who isn’t aware, is a scripting language that is fairly robust. You can do a lot with it without having to know a lot about programming. A lot of machine learning technologies are built in Python, and the Generation AI Nexus leverages Python for its machine learning algorithms, correct?
Ali:	13:22	Yes. That was a great preface. Sometimes you forget that you may not have a fully technical audience so, thank you for explaining that.
Ali:	13:31	Another technology that we’re using is Jupyter Notebooks, which is an easier way to use Python, and develop and code in it, and it’s not as daunting just because of the layout and the way it works. One thing we had to be mindful of is these students don’t know how to code in Python, so we added that initial section to help them understand how Python works and what packages are, and what are functions, and what are these different commands that they were seeing. And then we also went into reading data into Python.
Ali:	14:05	How do you take an Excel file and read it in, and then when you read it in, how do you create variables out of the data that you have, and how can you clean it? How can you play with the data, how can you manipulate it, how do you get into it? This part was a little bit challenging, because when you don’t have any knowledge of Python, you need to have some initial grasp of computer science and understanding how variables work, what are data structures, what are lists, what are these different techniques, what are foreloops.
Cameron:	14:36	That’s a few years’ worth of computer science you’re just jamming into this course.
Ali:	14:39	Yeah, I tried to stay away from some of the very technical stuff, and I tried to make it very easy to digest. Anywhere where there were some possible questions, we tried to supplement with additional resources in the notebook. From the feedback that we got, it seemed like those resources were pretty effective in answering a lot of the questions that the students had.
Cameron:	15:01	Something I’ve seen getting to interact with and participate in a small part with the Generation AI Nexus project is the Jupyter Notebooks are very robust, in my opinion. Because I was involved in a hackathon with some high schoolers, where they were, I would say, a little more tech-savvy than perhaps the people who are taking this project management class. But their benefit was … there are these different toggles within the Jupyter Notebook. If your goal is to just show people data and give them some sliders to play with, and see how does this impact the data or whatever, that’s one thing.
Cameron:	15:36	But then you can hit a toggle and it’ll bring up a little code block, basically a text field you can type in custom Python code, or some specific query or whatever. Basically just a way for you to get a little more nitty gritty with how the notebook treats the data so that if I am a complete novice, and I don’t want to be bothered with programming, I’m just trying to figure out this data stuff, I can stick with that. Or if I’m in a more advanced course where we really want to be building custom queries, or make sure that the data we’re playing with is correct or valuable, then I have that option as well.
Cameron:	16:12	I realize that’s a lot of just extra stuff, but I think this actually speaks to something about the data itself, which is at the end of the day, as much as you want to be able to play with your data, as you pointed out before, it can be hard to get enough data. You would think in a world this connected, it wouldn’t be hard to get data. But it’s very difficult to get enough data.
Cameron:	16:33	As you were pointing out, you do a lot to generate synthetic data to kind of fill in that gap. When you’re generating that synthetic data, do you have to take any special steps to make sure that … I believe that the term is overfitting. As I understand it, there’s two kinds of overfitting. Overfitting where the machine learning algorithm itself, the way you’ve customized it or set it up, maybe through these Jupyter Notebook toggles, it basically stops thinking about the general aspect of “if I saw this kind of data in the wild, what would I think?” And focuses specifically on “this data is the only data, and therefore as long as I’m seeing data that looks like this, I will know what to do.”
Cameron:	17:12	But it makes it very difficult for it to accommodate very different-looking data, where suddenly your budget’s wild, but your time is also wild.
Ali:	17:20	Yes, of course.
Cameron:	17:22	Okay. Something I wanted to know was were there considerations around that, in terms of how you generated the data itself?
Ali:	17:27	Sure. I completely agree with that point. There are two ways to overfit, like you said. One is the model is overfitting to the data that is presented, or the data itself is biased in the sense of, especially with synthetic data, where maybe this data isn’t representative of the real world because you’re creating it from scratch and it’s not real.
Ali:	17:47	And there were some considerations made for that. I wanted to be as true to the original data set as possible to minimize that, because we didn’t collect this data. This data came from a third-party source and I wanted to stay true to the statistical underlying parameters that that data set already had. It had around 800 to 900 rows and so, I didn’t feel that it was too small of a data set where if I kept true to the statistical properties that it already had, we would have overfitting.
Ali:	18:20	If, for example, we had a data set with maybe like, 50 to 60 rows, and I had to create synthetic data based on the statistical properties of that data set, then I would say that overfitting would be a much larger problem there, because the sample size of the original data set is so small. But when you have 800 to 900 rows, you have a little bit more that you can play around with, and you have a little bit more statistical variability on that data set to be able to create data that’s a little bit more representative.
Cameron:	18:53	I think a final note to go out on, the benefit of this technology is you’re introducing machine learning to the masses. It’s making it easier for our sponsors to get their workforce acquainted with machine learning. It’s making it easier for academia to train people, or find ways to show people the power of machine learning.
Cameron:	19:10	But I think you pointed out something very valuable there, which is a lot of people, they hear machine learning and they can think of it as this be-all end-all, this is the future and everything is going to be perfect, because machine learning’s great. But what we’ve found in all sorts of different research studies is machine learning has a problem of bias, and being basically misunderstood at times, which can be extremely harmful to different populations.
Cameron:	19:34	As someone who loves exploring the question of bias in machine learning in some of my own work that I’ve gotten to do, I was curious if you could speak to the differing perspectives that this platform brings to the users, and how that helps them mitigate maybe not just the bias question, but also other questions around how to properly engage with machine learning technologies?
Ali:	19:56	Great question. I think one thing that people tend to forget is that data science is not supposed to be black box in the sense that you don’t want people to just think of data science as I throw my data into this model, I get a specific result out, and that’s the end of it. I got my results, I can now apply this model to new data, and I can take whatever the model pops out and that’s the end of it.
Ali:	20:21	I don’t think that’s the approach that we want to take. It should be more of understanding that data science has these advantages, it has these pros, but also has these cons. You have to really evaluate what methods you’re using for what data. And you need to understand that sometimes a certain model will pop out a certain result, and you have to take it with the consideration in mind that that model may have some bias, it may have some shortcomings, and it may not be addressing all the problems or potential other data points that might come in the future.
Ali:	20:56	One question that still needs to be answered is, how much detail can you go into with these lesson plans where students can understand the power of data science and the power of these different machine learning algorithms. But then also take away that they can’t just apply these algorithms and models to everything. There should be careful consideration on what you’re using these models for: AI problems with AI and ethics questions, with problems in privacy, problems with autonomous vehicles and other risks like that with new technologies.
Cameron:	21:32	I think that’s always going to be a difficult question, because there’s a lot of research being done into … there’s literally projects that are like, how do you get machine learning to show their work, show how they came to a conclusion. Because bias itself is technically not a bad thing. I mean, bias usually speaks to a heuristic, a truth about the world. But the important thing is we should be able to account for certain kinds of bias, bias that we don’t want to have that might be based on inadequate or inaccurate data.
Cameron:	22:00	Thank you so much for your time. This has been amazing conversation. Before we go, if there’s anyone who wants to get started in machine learning or getting up to speed with how these different tools work, if there are any resources you could point them to or anything internal or external to MITRE that people should be aware of?
Ali:	22:17	Sure. I would say one great place to start, especially if you’re interested in these different technologies, methods, and you want to get started in coding, I would say that the Internet is a great place to learn these things. I can’t even count how many. After scouring the Internet for resources, there are hundreds of thousands of courses online and they all do a great job. There are also tons of different textbooks. Unfortunately, I can’t recommend one specific one off the top of my mind. But there are multiple textbooks out there that really go into detail about data science, about Python, about other coding languages, such as R, that are used in data science, that are great resources.
Ali:	22:58	I would also say that hopefully, as the Generation AI Nexus spreads and grows in the coming years, hopefully that’ll also become a resource for students around the world.
Cameron:	23:09	All right well, we’d like to give a big thank you to MITRE and the Knowledge-Driven Enterprise for making this podcast possible, and we’d like to give an even bigger thank you to you, Ali. This has been my conversation with Ali Zaidi, a data scientist in MITRE’s Data Analytics department. Thank you.
Ali:	23:21	Thank you. Thank you so much for having me.