2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more

5 things you need to know when creating a cognitive app

The field of artificial intelligence (AI) has certainly had its ups and downs, with perhaps more than its share of hype cycles and AI winters, but a number of important technical advances have finally come together to put AI applications within reach of developers. High Internet connectivity speeds, APIs to expose cognitive algorithms, and new methods like deep learning are creating new opportunities for the use of AI. Of course, with these opportunities come new challenges that can sometimes be as difficult as the original problems AI is supposed to solve. These problems aren’t just technical in nature but require thought about policy for how these algorithms can behave and for what they can and can’t reveal.

Client-side vs. server-side concerns

The first thing to know about AI is purely architectural. Suppose you’re developing a mobile app to provide recommendations for restaurants based on a user’s current location. There’s a natural division of labor in this app that pushes some functionality to the client mobile app and some to the server side.

On the client side, computing power is measured in handfuls of gigaflops, and using that much would quickly dissipate battery power. On the server side (depending on the cluster), computing power is measured in sustainable teraflops, and of course electric power consumption is not a factor. Similarly, storage capacity for a mobile device is measured in gigabytes whereas clusters can exceed hundreds of petabytes. So clearly, although the mobile device is perfect for identifying location and interacting with users, building an application that stands alone on the device isn’t ideal.

A better architecture for our restaurant app would be for the server side to maintain a worldwide list of restaurants, menus, and reviews along with user history, likes, and dislikes. Using content-based filtering (a recommendation system approach that considers something of interest and user preferences), the server side would then use the user’s current location and historical eating preferences to identify a list of restaurants nearby.

This example is an obvious separation of concerns based on architectural capabilities (although the lines can sometimes be blurred given what you want to achieve and the trade-offs you’re willing to make.) However, storing a user’s data and preferences in the cloud comes with its own challenges, as you’ll see.

Data and privacy

Storing a user’s data, even when the data doesn’t uniquely identify that user, has inherent challenges that you must manage, and these challenges go far beyond simply encrypting data and protecting it so that it can’t be exposed through cyberattacks.

In 2012, a Target customer purchased unscented lotion, mineral supplements, and cotton balls (see How Companies Learn Your Secrets). Target’s statistical model correctly inferred that this purchasing behavior was commonly associated with pregnant women. Target then began to send that customer coupons for baby products as a way to drive sales. The recommendations were properly identified, but the customer in question happened to be a teenaged girl, and the coupons ended up in the hands of the girl’s father, who didn’t know his daughter was pregnant.

Another example of inadvertent release of information came from Netflix. In 2006, Netflix released an anonymized data set of movie reviews created from 480,000 user reviews. These reviews were in the form of user ID (an integer value), movie, date of review, and grade. In 2007, two researchers at the University of Texas were able to identify individual users from this data based on their movie reviews (and reviews from the public Internet Movie Database). This finding led to a class-action lawsuit in 2009 over the breach; Netflix settled the lawsuit in 2010.

On the other side of the coin, maintaining data about user preferences or searches can be beneficial to businesses. Unfortunately, the information is also useful to criminals. In 2016, a man whose son died after being left in a hot car was found to have searched “how hot does it get inside a parked car” and, even more suspicious, “how hot does it need to be for a child to die inside a hot car.” In 2011, a woman’s computer was found on which someone had searched for “chloroform” just prior to her daughter being found dead with high levels of chloroform in her system. (See Day by day: Key moments from the Justin Ross Harris trial and How your Google searches can be used against you in court.)

A more recent example involves Amazon Alexa, the company’s voice-operated intelligent assistant. Because Alexa is always listening for its wake word, prosecutors believe that it may have evidence to a murder and are requesting logs from the device (Amazon has so far rejected the request). Another smart device in the home – a water heater – is believed to show an inordinate amount of water used around the time of the murder, which is claimed to be evidence of a cover-up.

So, although we have an expectation of privacy, it’s not safe to assume that it exists on the Internet. Your storage and retention of user data, may also be driven by local, national, and international laws.

Speed, quality, and decay of data

Whether your application involves collaborative filtering or techniques such as deep learning, the fuel that drives these algorithms is data. Not all data is equal, however, and more importantly, data can be relevant one day and useless the next.

Consider the development of a fraud-detection application. The speed at which your application can arrive at the right decision is key. Identifying a fraudulent transaction after the fraud had occurred is obviously not much better than ignoring the data all together. Speed in some applications is a key tenet of the architecture and drives many concerns for real-time analysis.

Historical data can be completely relevant, but it is sometimes less relevant than current data. Take, for example, a recommendation system that predicts what might be of interest to you. Historical data can be useful in this context, but it won’t help the system if your preferences evolve. Therefore, the speed at which a cognitive application can apply data for decision-making and its ability to filter data over time to accurately model a user are important.

In addition, data quality is of the utmost importance in building accurate models. Deep learning models for image classification have data called adversarial images that a malicious user can use to trick the deep learning algorithms to misclassify a known, correctly classified image. Deep learning researchers view these images as security threats because their misclassification can have dramatic results (consider deep learning in the context of self-driving cars). Experiments have shown that applying these adversarial image perturbations to deep learning self-driving applications have led to misclassification of road signs to irrelevant objects (see Universal adversarial perturbations).

From a user point of view, trust of an algorithm in the context of prediction and recommendation is highest when the algorithm can explain why it provided the solution. In recommendation systems, for example, explaining why a product was recommended (say, because the user purchased or viewed related items) increases trust in the user.

Libraries and frameworks

When you’re developing a cognitive application, you don’t want to have to reinvent the wheel. Fortunately, open source options are broad and deep in the cognitive space.

If your cognitive application deals with large amounts of data, big data frameworks such as Apache Hadoop and Apache Spark offer not only the framework for managing large data sets but also machine learning algorithms to make sense of your data. One of the most popular frameworks for machine learning at scale with Hadoop or Spark is the Apache Mahout project. This framework comes equipped with a variety of machine learning algorithms, including item-based collaborative filtering, naive Bayes, hidden Markov models, and several clustering and dimensionality reduction algorithms. A recent addition to Mahout is the Samsara platform, which provides a vector math experimentation environment that uses an R-like syntax for at-scale data processing.

The Spark environment, which supports fast, in-memory computing, also provides a machine learning library called MLlib. This library includes many machine learning algorithms as well as workflow utilities for building machine learning pipelines.

If you don’t require your cognitive application to process data at large scales, libraries and toolkits are available that apply machine learning algorithms to smaller data sets. One of the most popular is scikit-learn for Python. This tool set is built on NumPy and SciPy and implements several machine learning algorithms that you can easily apply to data sets within a Python environment.

In addition to generalized platforms for applying machine learning algorithms to data sets, specialized environments exist for specific machine learning applications. One example is Autoware, which is an open source platform for urban autonomous driving. This software includes acceleration, brake, and steering control, with automatic detection of driving lanes, signals, cars, pedestrians, and other objects. Autoware was developed for the Ubuntu (Linux) distribution and uses the Berkeley Software Distribution license.

In 2015, Google released its open source machine learning system, called TensorFlow, which is a deep learning framework. TensorFlow processes multidimensional arrays called tensors through computations represented by stateful data flow graphs. This framework can run on a single device as well as clusters of CPUs and GPUs. Other important deep learning frameworks include Torch, Theano, and Apache Singa.

Finally, IBM Watson offers a range of generally applicable deep learning-based services for language, speech, vision, and data analysis, with too many specific APIs to do any real justice to here. For more information, check out Watson Developer Cloud, where you’ll find open source Watson SDKs for Node, Java, Python, iOS, and Unity. Watson services are available on IBM Cloud.

So, within open source, you can find implementations for most machine learning algorithms, whether you plan to process data on large scales or in the context of a single computing system. You can also find specialized packages that implement domain-specific applications.

Fusion, ensembles, and the future of data processing

As our means of collecting data grows, so does the push toward cognitive computing to make sense of this data. Big data frameworks can provide the means to store and process data at large scales, but it’s the algorithms that can reduce this distributed data to something meaningful by which we can make effective decisions.

One domain whose mass of data is eclipsing our ability to grasp it is the Internet of Things (IoT). The IoT will drive new architectures for data collection and distributed management with new forms of security that do not exist today.

Another area of interest in data collection, processing, and machine learning is called quantified self. Quantified self is a movement to incorporate data acquisition into a person’s life to enable self-knowledge and personal informatics. Data can be fused from many sources, including fitness trackers (which track data like heart rate and quality of sleep) and online sources like GitHub commits. Data from these sources could seek correlation between exercise and sleep to work output. Decision tree induction is an ideal machine learning approach to identify factors to variables like productivity and work output. When this data is aggregated from a large population of users, interesting characteristics are certain to be exposed. This fusion of data sources will be critical to gaining new insights into populations.

Beyond fusing data sources, however, is the concept of ensembles. Ensemble methods apply multiple learning algorithms to obtain better-quality results than could be possible from a single learning algorithm. By using multiple data sets that provide different perspectives on a given domain, and then applying ensembles of machine learning algorithms, improved performance and results can emerge.

An example of the benefit of ensemble methods is the Netflix Prize. The Netflix Prize was eventually won by applying several different algorithms that were combined to exploit the strengths of each model. These algorithms included singular value decomposition, restricted Boltzmann machines, and gradient-boosted decision trees.

Going further

This article explored a variety of considerations for the development of cognitive applications, including architectural considerations from a client and server perspective and the use of frameworks and libraries to build your applications. Much of the article focused on data concerns, from privacy to characteristics like staleness. Finally, it explored some of the challenges ahead for new technologies such as the IoT. The use of multiple data sets and ensemble methods have shown benefit in improving results.