Artificial intelligence (AI) is all the rage, and your interests have brought you much chatter of forward or backward chaining, neural networks, deep learning, Bayesian logic, clustering, classifier systems, and the like. These are all AI techniques, but the unsung magic in AI comes from giving a similar degree of respect to the significant quantity and quality of data it requires. In effect, AI needs big data. In my previous tutorial “Big-brained data, Part 1,” I gave an in-depth introduction to the role of data in AI. In this tutorial, I explain how to apply the iterative version of the well-known software development lifecycle (SDLC) to data with AI applications in mind. Though there are many AI algorithms that use data in different ways, machine learning in one form or another is driving most of the current AI boom. For this reason, and in keeping with the previous tutorial, most discussion and examples in this tutorial follow the emphasis on machine learning.
The AI data lifecycle
The best way to think of effective gathering, preparation, and use of data for AI is parallel to the software development lifecycle. In the same spirit of recent developments in agile development, I favor a well-defined but iterative approach to managing data for AI rather than a rigid “waterfalls” approach.
Developers should already be familiar with the iterative SDLC. As a project is initiated it enters the planning and requirements quadrant after which iterations continue throughout the lifetime of the software, truly putting the “cycle” in the lifecycle. There are variants on the idea, including those that treat deployment as an exit from the process after testing.
Of course, developers usually think of these stages in terms of functional specifications, database schemas, code interface and structure diagrams, program code, and test cases. Preferably, there would be data flow diagrams as part of design, but too often it seems that data is treated only fragmentarily in the SDLC, which is especially a detriment to AI development.
Planning and requirements
The most important thing the team can do overall is to incorporate the need for data into planning alongside the general problem definition and understanding of the deployment environment. If the team is using one of the many AI techniques that require training examples, the team should decide what sort of training data is to be acquired. This depends on what data is available, as well as the problem statement and requirements, because the more the training data resembles the spectrum of results that are needed from the deployed software, the greater the chance of success.
There can be a sort of feedback loop where a lack of availability of suitable training data causes a modification to the requirements for the solution. Perhaps the original requirements can be restored in future iterations. Say, for example, that the team is developing an iris flower identifier program, but all that is available for training at first is the famous iris data set (see Part 1 for more information on this). The team might decide that for early iterations the goal should be to recognize an Iris setosa from the other two species with a high degree of confidence, but to accept less confidence in distinguishing Iris virginica and Iris versicolor. As better techniques or better data becomes available in later iterations, the expectations could grow toward confidently classifying all three.
Analysis and design
You should be putting together the raw data sources for the AI while you are analyzing and designing the code. As you start gathering this data you will begin to understand how it needs to be vetted, augmented, maintained, and evaluated. Determine format limits and parameters early, such as the minimum and maximum size of images or length of audio. In the case of something such as the iris data set, do not forget to control for units. You do not want measurements that are made in inches to get mixed up with measurements made in centimeters. Remember the words of your old math teacher, everything has to be annotated with units everywhere, in the data value itself, as an abstract data type, or perhaps in the data schema. In any case, you might want to include some sort of unit validation or conversion in the code or in instructions for expert review.
One important but often overlooked thing to include in the data design is provenance. Where did each data item come from, and how well can we trace it? What was its chain of custody through people or systems before it got to your application? How exactly did your application change it, either through algorithms or expert review? If you find anomalies, sources of bias, or other problems, provenance can be the key to understanding what data within a larger aggregate corpus you need to fix or discard. It is also important for continuous systems improvement. As your application matures and you get a better sense of what techniques and processes are effective and which ones are deficient, you can include that same understanding of data sources.
Part 1 contains a fuller explanation of dimensionality. This stage is also when you must decide or reconsider the dimensionality of data either for training samples or to be worked through the algorithm. Does adding more detail to the data kill performance? Does it improve outcomes or detract from its reliability, perhaps due to the curse of dimensionality? Analysis and design are commonly where you would establish techniques for dimensionality reduction.
Data flow diagrams (DFDs) are an important but underutilized design artifact. These were first described in detail as a key part of the software engineering process in the late 1970s in the highly influential volume Structured Design, by Ed Yourdon and Larry Constantine. Their work was based on earlier work by David Martin and Gerald Estrin. Developers in areas such as online security have learned the importance of a data flow diagram for an online banking application as it passes from the remote user’s browser through layers of increasing security within the retail ledger and back-office reconciliation systems. A similar level of detailed description is an important part of getting the AI data preparation process nailed down, which is in turn an important factor in field success.
DFDs for AI data tend to be based on common data acquisition and preparation techniques. The following figure is an abstract version of a DFD that you should update with the more specific steps you are taking, based on the nature of your problem space, the nature of the data, and the algorithms that you are employing.
A data flow diagram is about data, not process. The rounded boxes are processes, but they should be described in terms of what they do to the data. The arrows are key that show the steps that the data undergoes through the processes. The final repository for the data is the corpus for your application. Though DFDs are not about process flow, I included the cycle icon as a reminder that the process should be ongoing, with new data being acquired and ushered into the corpus as dictated by analysis, such as the results of population scoring.
To help you specialize a DFD for your own projects, here is a description of some of the abstract processes that are included in this example.
- Data acquisition is the process of getting raw data to be eventually incorporated into the corpus. It could be through digitization or data scraping—brute force extraction of data from some source on the web—or elsewhere.
- Data wrangling is the process of converting the data format to what is required for input, as well as to detect and annotate the data with its mechanically understood metadata characteristics, including provenance. Here is where you also attempt to identify discrete data items and eliminate duplicates.
- Data cleansing is the process of assessing the identified and tagged data items to correct or remove those that are corrupted, incomplete, or inaccurate.
- Scoring and incorporation is the application of statistical analysis to ensure the overall health of the resulting corpus. Each item might be scored according to its applicability to the needs of corpus maintenance. The entire population of items to be added to the corpus can be scored to make sure that its composition maximizes the efficiency and accuracy of the algorithms.
A popular model for the scoring step is exploratory data analysis (EDA), which involves many combinations of visualizations of relationships between the different variables across the population. Doing a thorough job in EDA is an important factor in a successful reduction of dimensionality.
In this tutorial, I move on to look at later stages in the SDLC, but it’s very important to absorb the discipline that data analysis and preparation is an ongoing activity throughout all stages of the SDLC for AI.
Implementation and testing
One of the key lessons of SDLCs from the early days, through the many methodologies that have emerged for software development, has been to put actual implementation in its place. Most people come to programming because they find it an exciting pursuit. There is a thrill in instructing the computer step-by-step and watching it do something special. There is even a sort of thrill in hunting bugs and finding improvements to algorithms.
This psychological factor has always meant that developers end up with a tendency to want to dive into the implementation stage as soon as possible, and not put enough emphasis into planning, analysis, design, structured testing, systems management, and other aspects of successful projects. These other stages can seem boring in comparison to the actual coding, but experience and engineering discipline show, for example, that the common causes of project problems and cost and time overruns include failing to clearly coordinate requirements and insufficient attention to structured testing.
In the SDLC diagram, you can see that implementation is really just the tail end of analysis and design, and this emphasis carries over to data. The attention to the data during the implementation stage is largely a matter of ensuring that the design is faithfully accommodated in the coding of algorithms.
This all changes during testing. In an AI application, testing is where all the work done on data gathering and preparation can be seen to have succeeded or otherwise. Training data is put through the algorithms and compared against expected results. The expected results might have been derived in the original process of acquiring the training corpus, or perhaps it could have been derived from previous iterations of the SDLC. The arc through testing and evaluation is especially emphasized when developing AI applications, and this critical juncture is also the reason why AI generally requires more iterations of the SDLC than other cases before it is ready for use in production.
Evaluation, and the cycle continues
At the heart of testing is a structured comparison between expected and actual outputs, a mechanical assessment of whether the algorithm processes training samples as expected. This is just the start of the process of making sure that the AI produces reliably useful results. Evaluation is often supervised by experts who are operating with unplanned data from potential real-world applications. If you are developing a mobile agent, you might have a set of recorded phrases from several known speakers saying something like “what is the population of Ghana?” and you can compare the voice response against the expected answer. Evaluation might include the process of having new speakers asking the same question, or a variation, “what is the population of Nigeria?” and the experts evaluating the results can have a better sense of how the agent might fare in more realistic conditions.
The evaluation process generally includes the deployment of the application and its use by beta testers and then the customers. Feedback from all of this data would inform subsequent planning phases, and thus fuel subsequent iterations. Perhaps reports from the field indicate that the agent has trouble distinguishing the question “what is the population of Ghana?” from “what is the population of Guyana?” with some speakers. This might become an important sample for training and testing in subsequent iterations.
The first tutorial explained how important data is to the creation of AI and cognitive applications, how this importance has been constant throughout the history of the discipline, and how it has been connected to the historical successes and failures of AI. In this tutorial, I explained how to handle AI data with the same discipline as you do the code, in a parallel process to the code. Apply the traditional, iterative SDLC to make the most of the benefits that come with the large supply of data available these days, while remaining vigilant against the dangers.