In recent years, data-driven decision making has become critical to the success of corporations. There are many benefits of using technology for data-driven practices including the optimization of production and manufacturing, reductions in customer attrition, reductions in data redundancy, increased profitability, and the creation of competitive advantage. So data science has become popular as organizations embrace data-driven decision-making approaches. Data scientists need a wide range of skills including mathematics and statistics, machine learning and artificial intelligence (AI), databases and cloud computing, and data visualization. However, it is difficult to recruit enough data scientists, particularly with sufficient domain knowledge, such as banking, healthcare, human resources, manufacturing, and telco, for the tasks to be performed and decisions to be made. And increasingly, data science is becoming a kind of literacy, in that an understanding of data science techniques is required for many job roles, including roles where the employees do not have strong coding skills.
So in parallel to the development of new tools to increase data scientist work efficiency, technical developments have emerged that focus on the creation of software to automate tasks within the data science workflow such as Google’s AutoML, H2O, DataRobot, and open source libraries like Auto-sklearn and TPOT. Many of these systems build on scikit-learn Python machine learning libraries. They are examples of AI for AI, in that AI technology is being used to build an AI solution. IBM® has produced state-of-the-art AI for AI technology and incorporated it into its product portfolio – in the form of AutoAI.
What is AutoAI?
AutoAI automates data preparation, model development, feature engineering, and hyperparameter optimization. AutoAI AI lifecycle management is a great help when getting started and when exploring what questions to ask. It then supports subsequent experimentation, model modifications deployment, and governance steps. AutoAI comes as standard with IBM Cloud Pak for Data to be used and scaled across hybrid multicloud environments. AutoAI is also available on IBM Cloud through IBM Watson™ Studio.
AutoAI is an exciting example of AI for AI. The AutoAI tool automatically analyzes your data and generates candidate model pipelines that are customized for your predictive modeling problem. These model pipelines are created over time as AutoAI algorithms learn more about your data set and discover data transformations, estimator algorithms, and parameter settings that work best for your problem. Results are displayed on a leaderboard, showing the automatically generated model pipelines ranked according to your problem optimization objective, encouraging you to experiment further.
Better together: Ask better questions with AutoAI
Data science is frequently about asking better questions, for example, identifying the appropriate attributes that are predictors for an outcome through exploration. That means building many different models, selecting different features, and applying different hyperparameter optimizations. Options in AutoAI make it possible to explore better questions either by speeding up the AI process or by offering points of human engagement.
The entire AutoAI process can be automatically completed in minutes (depending on data volumes and other considerations) without human intervention, creating a baseline solution and making it suitable for beginners. However, domain experts can easily interact with AutoAI to incorporate their knowledge into the automated pipeline to improve the model produced and to customize to their local requirements.
Examples of optional points of human interaction where experts can manually specify their own preferences in the automated AutoAI process to incorporate their domain knowledge include:
- Data preparation – splitting data to train and test with subsets, filling in missing values
- Advanced data refinery – specifying a subset of data to save resources and time, joining multiple data sources
- Feature engineering – applying certain off-the-shelf feature transformations, creating new features from the interactions of multiple features
- Neural network search – adopting specific architecture from the latest scholarly publications
- AutoAI pipeline optimization – selecting certain off-the-shelf algorithms, or plugging in existing algorithms
- Hyperparameter optimization (HPO) – turning HPO on or off, or choosing to run HPO each time after an auto-feature-engineering step; defining the search space for certain hyperparameters
- One-click deployment – choosing the target deployment environment, on IBM Cloud or on other cloud infrastructures
- Explainability and debiasing – detecting and mitigating bias from data, algorithms, or training with the help of AI Fairness 360
- AI lifecycle management – monitoring post-deployment performance in real time and improving model performance with reinforcement learning in one click
Better together: Complete your AI pipeline faster
There have been claims made that AI built by AI outperforms humans. A recent qualitative study conducted by Dr. Dakuo Wang and his team involved data scientist participants. Some participants were asked to build models using IBM AutoAI. The other participants were each doing the same task but using Python libraries in a Jupyter Notebook environment. The study showed that data scientists working together with AutoAI can build models significantly better (0.92 versus 0.90 in ROC AUC score), faster (4.4 mins versus 15 mins), and with less human errors (100% versus 46.7% of participants successfully finished the task in the time assigned). The study also revealed data scientists’ attitudes and perceived interactions with AutoAI systems, and the interviewees believe there will be a collaborative relationship instead of a competitive relationship between data scientists and automated AI systems.
AutoAI has been designed to incorporate human feedback and to augment data science practice while speeding up the experimentation process. This makes it possible for individuals without strong coding skills to explore different options, identify better questions to ask, select the most suitable models, and move models into deployment.
The dashboard for AutoAI fosters human interaction rather than replacing it, enabling data scientists and domain experts to make informed choices and contribute to model creation. In the following image of the IBM AutoAI system, you can see how eight pipelines are constructed (top visualization) as well as a leaderboard (bottom list) that ranks models according to a selected metric (ROC AUC). Out of dozens of algorithms, AutoAI chose two, logistic regression and random forest, and generated four models for each algorithm. Among the four models all using the logistic regression algorithm, pipeline P2 includes a hyperparameter optimization step, differentiating it from P1. Pipeline P3 includes a feature engineering step, and P4 includes a second HPO step.
IBM researchers refer to this new paradigm of people working together with AI systems as “Human-AI Collaboration” where human and AI systems work as partners on particular tasks, in which each party contributes complementary, indispensable, and accountable capabilities.
AutoAI comes as standard with Cloud Pak for Data to be used and scaled across hybrid multi-cloud environments. There are a number of benefits for AutoAI, particularly in support of humans working to better understand and make predictions about their particular business or specialty. The benefits include:
- Building models faster because AutoAI prepares data, identifies features, performs optimizations, and generates models much faster than humans doing the work by themselves.
- Overcoming the skills gap, making it possible for industry domain experts who are new to data science to incorporate data science methods into their daily work.
- Uncovering more use cases because exploring models is quicker, giving more time for data scientists to experiment.
- Identifying key predictors that make a difference by using the auto-feature engineering option, which makes it simpler to extract predictions from a data set.
- Ranking and exploring models by comparing candidate pipelines to determine the best model for the particular task.
- Deploying models easily through AutoAI-generated pipelines. The deployed models can then be accessed and predictions made through REST APIs.
This technology is changing quickly, so stay tuned for further developments in the areas of transfer learning, business constraints, and more.
Dakuo Wang is a Research Scientist at IBM Research AI, Cambridge, Massachusetts. His research lies in the intersection between human-computer interaction (HCI) and artificial intelligence (AI). He is now leading a team of researchers, engineers, and designers to conduct research and design user experience for IBM AutoAI, a solution to automate the end-to-end machine learning pipeline. From studying how users work with various AI systems such as AutoAI, chatbots, and clinical decision support systems (CDSS), he proposes “Human-AI Collaboration” as a new framework to study and design AI systems to work together with humans. Before joining IBM Research, Dakuo Wang got his Ph.D. and M.S. in Information and Computer Science from the University of California Irvine, a Diplôme d’Ingénieur (M.S.) in Information System from École Centrale d’Électronique Paris, and a B.S. in Computer Science from Beijing University of Technology. He has worked as an engineer, designer, and researcher in France, China, and the United States.