Taxonomy Icon


Several months ago, I joined a new team at IBM — the Emerging Technology and Advanced Innovation group. This group was developing a new capability that uses new technologies, most notably Linux containers and Raspberry Pi computers, to provide edge computing for processing sensor data. I had no experience with either containers or the Raspberry Pi devices. And, I was an eager guinea pig for the early iterations, which quickly taught me the basics of these technologies.

After I understood these basics, I started to speculate on the types of applications that we might build. The team had already built proofs-of-concept (PoCs) called BlueHorizon. BlueHorizon collects weather data, radio signals, and network connectivity information and provides that data in an interactive map that displays the various participating devices and any aircraft in the vicinity (see ).

Figure 1. BlueHorizon map and user interface (

I wanted to build another PoC.

One aspect of the IBM portfolio that the team had not yet leveraged was the Watson cognitive services, so I initiated my search of the potential space of sensor data collection to include a cognitive component. Since the premise of the project was converting sensor data into information (for example, aircraft ADS-B broadcast frequency signal into structured flight data), the use of a cognitive algorithm “at the edge” would be both a challenge and a contribution to the PoC portfolio.

This article describes the lessons learned when I built a cognitive IoT solution. The first lesson teaches limited expectations in any cognitive system and a corresponding need to train for specific scenarios and use-cases. The second lesson teaches you how important understanding context is in building cognitive systems. The third lesson teaches that what you don’t know you don’t know is at least as important as what you do actually know. Summarily, cognitive solutions are experiments in which the user is both experimental subject and arbiter of ground truth and therefore success.

A hacker’s gotta hack

Part of this story is who I am and how I arrived at this point in my life. To make a long story short, I am a life-long hacker and healthcare dilettante who dabbles in a wide variety of technologies. One recurring theme across these hacks was the desire to make things easier and more pleasant for me and my family. Specifically, I like to use technology to replicate and automate various tasks. The older I get, the desire to not be burdened by small annoyances continues to grow, and I now realize that I might need assistance in my later years (and even now).

I also have a penchant for video — from my initial exposure to real-time video processing by analyzing railroad tracks, to my current infatuation with wireless surveillance and home theater systems. Our family residence is located in rural Silicon Valley, near the historical mercury mining village of New Almaden, in the southwest corner San Jose, California. A big part of our life is our dogs (and our children), and I have added numerous wireless surveillance cameras over the years in response to my wife’s inquiry of “What is happening out there?” whenever the dogs explode in a fit of barking. What started as a single camera in the dog house turned into 21 cameras that watch everything from the dogs to the kitchen. With a variety of simple hacks and the acquisition of an iOS application, I now was able to see all the cameras at any time (see ) and not have to get up off the couch!

Figure 2. Web interface to all my video cameras

However, even I was somewhat overwhelmed by all these cameras. Each camera sent a 5-second video as an attachment in an email with every motion detection event; I was receiving over 1,000 emails per day. Even just deleting the emails without inspecting the videos became almost untenable. My first foray into solving this problem was to attenuate the sensitivity and bounding boxes for the motion detection algorithm on the camera. With judicious experimentation, and after I uttered an immense number of expletives, I was able to bring the volume under control, but still suffered from numerous emails. My second effort was to eliminate the individual emails. Instead, I aggregated the short clips with an email agent and published them to my Plex Home Theater (see ) as a daily video. I was then able to view each day’s montage on my TV, smartphone, or notebook computer (including up to 32X speed review), which was a massive improvement.

Figure 3. Plex user interface to daily movies

Still, I was not satisfied. This solution still required a human-in-the-loop (me) to look for interesting things in each day’s video. And each video could easily be over 30 minutes (360 events). And, let’s be honest, you can’t truly review the video at anything more than 4X speed, which required up to 10 minutes of my precious time.

As I was relaxing on the couch and licking my wounds, I saw an IBM Watson commercial. While I can’t recall the specifics, it did remind me of my prior successes in image processing, and I wondered if the Watson Visual Recognition service could process my videos automatically and notify me only when something unusual happened. Of course, I then recalled our many dogs (and our four children), and I thought that unusual things were always happening to us. However, my parents lived alone, in a big house, thousands of miles away, in Chicago, Illinois, with no children and no dogs.

Perhaps I could find a use-case that would apply to far away parents who were still living in their own home? A confluence of inspiration, perspiration, and aspiration coincided at that point. My Dad was an early adopter of the Amazon Echo (with Alexa) device and became enamored of its capabilities. I thought that a device that could observe their daily behavior could provide a means to both monitor and stay connected.

Age-At-Home – Hacking together emerging tech

I embarked on the definition of a new project that I called Age-at-Home. In my PoC project, I would use Linux Docker containers, Raspberry Pi devices, IBM Watson services, and IBM Cloud services to help the elderly be able to “age in place” in their own home. I did some additional research, identifying a research project by IBM with the city of Bolzano, Italy that used passive carbon-dioxide (CO2) sensors to track the presence and movement of elderly individuals in their apartments.

I discovered other studies that used the instrumentation of individuals (for example, having the people wear Bluetooth watches or pendants) and the instrumentation of the environment beyond CO2 detection, including sensors attached to cabinets, sinks, refrigerators, toilets, and other devices. My previous experience with the elderly and tablet computers, including the associated costs with instrumenting specific things, indicated to me that passive sensing was going to be cheaper and more reliable. (Don’t ask how many tablets were dropped in toilets).

My initial aspirations were grand, but the Bolzano experiment indicated that there was value in tracking presence and movement. So, replicating that sensing capability and adding a simple response method would be sufficient to demonstrate initial success. Thus, my minimum viable product (MVP) was defined: Send a notification when the elderly person did not wake up (that is, appear in the kitchen) within a particular time range (for example, +2 standard deviations from normal). See .

Figure 4. My MVP architecture for the Age-At-Home app

To build the minimum viable product, I needed a way to sense the presence of a person, preferably a specific person, but also minimally a way to ensure discrimination between a human and an animal, such as a pet dog. My prior experience with motion-detecting video cameras provided an interesting opportunity to both use the IBM Watson Visual Recognition service, but also highlight two important aspects: latency for such classification and privacy of the images classified. In addition, recent developments in image classification research indicated both means and opportunity to classify images on a low-cost device.

The initial configuration was relatively simple and straight-forward. After I investigated options for cameras that are attached to Raspberry Pi computers, the Motion package for Linux appeared to be robust and supported both the Raspberry Pi and an inexpensive USB camera for the Playstation3. The installation of the software was simplified by forking a Linux container definition for the Motion package ( and deploying that Linux container ( to the Raspberry Pi. The Motion package invokes a specified Linux executable when certain activities occur, such as when a motion event is detected and an image is captured, and provides the appropriate activity data. The data encoded in the JPEG image was akin to the frequency signal data that was provided by the software-defined radio and would benefit from analysis “at the edge.” However, I still needed to determine the types of analysis available on a Raspberry Pi.

Additional investigation indicated that a Raspberry Pi supported many Linux image analysis packages (for example, imagemagick, ffmpeg, and other packages). Examples of face detection (such as a “magic mirror”) were also running natively. Lastly, I unearthed some research on deploying deep learning neural networks (that is, SqueezeNet) to the Pi as well. In addition, the ARM processor design that is used in the Raspberry Pi was also being used by nVidia in their Tegra platform, notably the TX1; so, advanced GPU capabilities might also be available “at the edge” in the near future. See for the architecture diagram of all the components of the Age-At-Home MVP.

Figure 5. Architecture diagram of the Age-At-Home MVP

Although the edge analytics would not include the image classification initially, the calculation of the current activity statistics and the resulting conditional analysis could be performed locally, and then provide the required notification by using electronic messaging. The calculation of the aggregated statistics over the entire event history across all detected entities would occur in the cloud. That is, 53 weeks per year, 7 days per week, 96 intervals (15 minutes) per day, and a total of over 330 different detected entities, such as people, dogs, and things). The resulting model would be downloaded by the Raspberry Pi and used to compare the current activity with historical activity and activate the notifications as appropriate.


Cognitive IoT apps must be learning systems

Now, you’re probably still waiting for those lessons, so here is lesson 1: Your cognitive IoT application is a learning system.

My first mistake was in presuming that the presence or absence of a person in the kitchen was a viable determinant for when to send a notification. Other aspects to the sensory input than the detection of a person, such as turning on lights and overwhelming the camera sensor, needed to be considered. While person detection might be an initial hypothesis, it should be treated as a hypothesis, and your application should be prepared to act on alternative signals.

The very early days of radio are a great example of multiple engineers being required to adjust the frequencies, antennas, and other electrical components to get a clear radio signal. Engineers were component parts as the receivers and transmitters who interpreted the signal into data. There were recorders or transcribers who communicated the data as a coherent message. And, of course, there was the infrastructure to power the whole process. Only this specialized team could leverage the nascent radio technology. But, that was then. Today, we have automatic tuning for our car radios, mobile devices, and many other devices. We must anticipate this same progression in the signal processing that is provided by our learning systems.

This type of automated signal detection and discrimination is referred to as “active learning” in academic papers. These processes use an “oracle” — human or otherwise — to provide feedback on the signal quality that was produced by the supervised learning algorithm. For example, in my Age-At-Home context, the predicted classifications for any image are typically a set, ranked from highest to lowest. These predictions are often referred to as TOPn, for example TOP5. The highest ranked entity, typically referred to as TOP1, is considered to be the “correct” classification. However, the top two or three might be very close in rank value (or “score”), and these images are potential candidates for review by an oracle. The oracle then provides adjudication, and the example becomes part of that class for the next iteration of training.

Using (, I was able to deploy a Linux Docker container to a Raspberry Pi that was running an open source package (motion) to capture images. With the addition of a simple Linux shell script, I was able to send the images to Watson Visual Recognition service and receive classification scores. The image classifications and scores that Watson returned are defined with respect to the classifier used. The default classifier indicated a wide variety of entities and activities, from “Stove” to “Riot”. I extended the script to store the JSON results from the Watson Visual Recognition service into a NoSQL repository, IBM Cloudant, that natively understood JSON data (see ).

Figure 6. Selected detail of example JSON produced
Screen shot of example JSON

I was then able to begin to aggregate and analyze the results of the image recognition. After some extensive shell-script programming, I produced both CSV (comma-separated values) and JSON statistics for the various classifications, such as a count of stove images by interval of day and day of week (see ). The resulting multi-dimensional cube included count, max, mean, and standard deviation for both classifiers and scores. At this point, it was obvious that more powerful analysis tools were required.

Figure 7. REST API query for statistics for Stove on day 1, interval 10 (Sunday at 1:30 AM)

Automating the learning with Watson Analytics

My next stop was the IBM catalog of analytical services, starting with one of the newest, Watson Analytics. Watson Analytics provides an easy-to-use method to access, refine, and explore data visually. After I learned about the minor complexities involved in flattening JSON into SQL tables, and after I spent a little time refreshing my SQL skills, I could make the Age-At-Home image classification results available as a sparse pivot table that was suitable for consumption in Watson Analytics.

By using Watson Analytics, I was provided quick and easy visualization of the entities and activity time-series, but there was no easy way to share results or queries. Without the ability to share the results nor the ability to parametrize the input, it would be challenging to include Watson Analytics in any automated manner.

Watson Analytics provides capabilities to explore and visualize your data by using a catalog of pre-defined visual templates, such as a heat-map, and by using methods to associate the automatically discovered facts, dimensions, and measures. A very quick exploration, and a little trial and error, and I had two simple overviews of the insights from its analysis.

is a heat-map of the classifiers that Watson Analytics returned from the default recognition model, with the size of the box indicating the count of how many times that thing was recognized and the color of the box indicating the average confidence of that recognition. The Watson Analytics user experience provides an interactive method to exclude selected classes, which allows the relevant classes to be identified, such as in my example, the “Adult” class. That subset, such as all events in which an “Adult” was recognized, can then be analyzed further, such as by time of day or day of week.

Figure 8. Watson Analytics heat map of Watson Visual Recognition results

By inspecting the classification results in Watson Analytics I did uncover interesting potential signals, including identifying a “Riot” in my kitchen. However, the signal for which I was searching, which was the presence of a human, was not easily detectable amidst all the noise.

Using a subset of the signal spectrum from the default classifier was quite a challenge. A person was often not included in the result set; that is, the person was not detected. These types of failures are known as false negatives. The opposite of a false negative is a false positive, wherein the entity is detected, but was not present. Both these types of failures affect the user experience of receiving (or not receiving) notifications about abnormal activity. Again, additional levels of analysis capability were required. Most notably, I needed to identify false-negatives both across the population and in specific examples for subsequent inspection.

The second visualization in is a time-series that shows the population over the intervals of the day in order of score. To create this visualization, the data had to be refined into an aggregation of average classifier score by interval, day of week (Sunday, … ,Saturday), and week of year (1-53), for future adjustments for seasonality. These results indicated a strong potential for success in deriving a pattern for normative behavior by using traditional methods to cleanse, aggregate, and analyze the data. By discarding the apparently extraneous classifications and focusing only on those classifications that represent people, I appeared to have an easy route forward. A quick analysis of the distribution for that representative set of classifiers appeared to provide a normative model. shows the “people” are the five rightmost columns.

Figure 9. Watson Analytics time-series analysis

Enhancing the learning with Db2 Warehouse on Cloud and Looker

Thankfully, the IBM noSQL JSON repository (Cloudant) provides automatic replication into the IBM SQL database service (Db2 Warehouse on Cloud), which enables the use of off-the-shelf analysis packages for data discovery and reporting that use industry-standard SQL. Also, Looker enables query and analysis of complex data for presentation and sharing. (IBM and Looker are technology partners.)

To enable parametrized input and consumable output, a standard method would be preferable, especially if tools existed to provide those capabilities. A general-purpose query language like SQL would be a suitable choice for input specification. Also, output in either a tabular form (such as a comma-separated values file, which is always a favorite) or a structured form (such as JSON, which is becoming a favorite) would be consumable by almost any other component.

While maintaining SQL queries in my source repository was one option to manage the input specification, this process also required additional coding to automate. Fortunately, Looker provided a means to not only store and execute my SQL, but also generate SQL for Db2 Warehouse on Cloud. In addition, the SQL templates (called “looks”) could be saved, shared, and made available for external consumption as URLs, including output as both JSON and CSV files.

With a day of training, and some help from the resident Db2 Warehouse on Cloud expert at Looker, a distribution of the detected entities over time was clearly visible (see ). Quickly eliminating the false-negatives through the user-interface revealed the expected distribution for activity, but individual specific intervals were insufficient to adequately detect a person.

Figure 10. chart selected Watson Visual Recognition results

It was evident that the amount of data in the signal that was coming from the visual recognition algorithm contained too much uninteresting information (that is, entities in which I had no interest or which were almost always present). In addition, false positives and false negatives would adversely affect the calculation of daily activity — either missing the presence of a person in the kitchen (a false negative, most likely according to the data) or seeing a “person” when they were not present (a false positive).

The application was going to need to reduce the noise in the signal in order to provide a high-quality notification service. The arbiter of what was noise and what was signal would require some type of oracle to distinguish between the interesting and not. In many ways, it was as if a radio was simultaneously playing all the stations that it could pick up across the entire spectrum. A radio operator (oracle) was going to need to “tune” the application, and correspondingly have the opportunity to “name” the resulting “stations” for either personal preference or to tune in a new station.

It was time to take the next step and learn how to train the Watson Visual Recognition service with examples from my camera in my chosen location in my kitchen.


Context defines success

Embarking on the quest to train Watson introduced a myriad of issues, but the most important issue leads us to lesson number two: context defines success. Lesson number two begins with the simple premise that training Watson would require the definition of the “things” that were indicators of activity in the home. For example, the people and animals who live in the home. That initial set of { People, Animals } defines the context under which Watson will learn, or, in other words, what Watson can be trained to recognize in images.

Watson is able to see many things by using the default classification model. To understand human activity, the default set of things needed to be pruned to those things that I thought indicated the types of behavior I was seeking to analyze, such as the first occurrence of a person in the kitchen. Converting the digital signal of the camera into an event that indicates human activity in the kitchen required specifying some parameters up-front. Some parameters represented operational capabilities, such as the amount of time between detection of motion and the capture of an image. Other parameters bounded sensor sensitivity, such as the interval between image captures. And, finally, and again most importantly, some parameters defined the context with respect to the image classifications available (that is, what can Watson see?).

Context discovery: Learning by training

To continue to move forward without complete knowledge, a set of things on which Watson could be trained was required, starting with a simple set, { person, dog, cat }. To build a Watson visual recognition model, training images must be gathered for both negative examples (an empty kitchen) and positive examples (a person, dog, or cat [only] in kitchen). To collect and organize images into a training set, the user of the cognitive IoT system needed to select the exemplars (that is, a picture of a person only, no cat, no dog) from the captured images and to skip over pictures that contained more than one entity or were private in nature.

Figure 11. Initial user interface used to collect examples

The first step in the process was collecting the images from the Raspberry Pi. Each device has a 64 GB micro-SD card, but the images can be erased during software updates. Because images are potentially a privacy issue, a local-only retrieval service (FTP) was configured for the private LAN, and images were collected and stored in an on-premises, hierarchical repository that corresponded to the type of the highest scoring classification.

Initially the set of classes for the kitchen was defined to { person, dog, cat }. However, during user-experience testing (see ) it was obvious that the class set would need to be larger and dynamically defined. By adding a create function, I could add specific family members to the labeling options, but that also introduced an element of chaos. After a few evenings of labeling images for the training set, I finally had a sufficient number — approximately 800 images — one negative and seven positive classes — to start building a custom classifier. (By the way, the documentation for Watson Visual Recognition service suggests that you have at least 50 images per class.)

Providing context by building a model

The Watson Visual Recognition service requires compressed, structured payloads (such as .zip files) of images for each of the classes, with a limitation of 100 MB per class per API call, and an overall size limit of 256 MB. Additional constraints and guidelines must be adhered to as well, including the naming of classes, incorporating the negative class, and keeping to minimums and maximums for quantity and quality of images processed.

Now that I understood API and I had identified the limitations, I started to build a custom classifier by performing the steps individually through the Linux command line. I modified and extended my classifier until I produced the appropriate output. After a successful set of steps were completed, I added them into a growing script file, which was parametrized to automate that process.

The limited size of each training run (256 MB) and each class (100 MB) indicated that multiple training API calls would be required. By separating the process into distinct phases, such as building and then training the corresponding test set, I enabled snapshots to be defined at interim points in the overall process.

The training runs processed 50% plus one of the sample images and used the remaining 49%+ to test the predictions against known examples. Typically, a randomized selection from the whole set is used to split the population, and in alternative percentages (90/10); however, these techniques have not (yet) been implemented.

Figure 12. Confusion matrix for model quality assessment

Now that training was completed on half of the example set, the quality of the resulting model needed to be tested. Using the remaining examples from the curated set, the custom classifier was called repeatedly and the results generated were captured in a rank-ordered list of classes and scores in the range from zero (0) to one (1). If the highest scoring class matched the curated class (for example, “David”), then the prediction was deemed correct.

The Age-At-Home logic and analysis requires discriminating between person and not person and the union set of all persons clearly distinguishes from the contrapositive.

Releasing into the Wild: Observing its behavior in the field

When I introduced the custom classifier into the application context, I needed to specify the model identifier for each device in the API calls to Watson Visual Recognition service and also incorporate the results into the event construct. The custom classifier identifier is specified in the dashboard as a Linux application environment variable for each device. Both the default classifier and the custom classifier, when specified, are applied against the images.

The results of the custom classifier are labeled with the name of the classifier (for example, _roughfog__XXXXX). The results from the default classifiers are named either with their type from the entity hierarchy (such as “/vehicle/truck”) or “default” for the entity set that was defined in Watson Visual Recognition. Neither the hierarchy nor the entity set is published, and both results must be handled dynamically. The entities that are returned by the default classifier must be collected, cataloged, and enumerated in both hierarchical and flat name space topologies. Calculations of entity presence must be quantified, and for those entities that were identified as noise must be removed from the result set.

The JSON-encoded event includes the time, model indicator, and the classifiers and scores returned. The highest scoring class across all classifiers is replicated as the primary class for the event.

With the new model identifier provided as an environment variable for the Raspberry Pi container, and additional logic to incorporate the new custom classifier, new events were recorded into the NoSQL repositories for each device.

is a graphic for the installation of the kitchen camera, named “rough-fog.” The location of the camera is indicated in the right pane; the field-of-view (FOV) is indicated by the yellow area. The corresponding image on the left is the “average” picture for an empty kitchen, which was calculated from the Watson Visual Recognition training set for the custom classifier.

Figure 13. Average image and field-of-view (FOV) for kitchen camera

With a camera in the kitchen and a camera in the bathroom, and with custom classifiers that were trained to recognize family members and pets, the analysis of the signal could begin. In comparison to the results of the default classifiers, the results from the custom classifiers indicated a much more interesting pattern of daily activity.

In the kitchen, the weekly histogram of activity (see ) indicated the presence of my daughter, Hali, who was currently away at school. In addition, examination of individual events uncovered numerous examples of confusion between myself and my eldest son, Ian (he does not get up before 8 AM).

Figure 14. chart of daily kitchen activity by residents (including pets)

For the bathroom installation, the weekly histogram appeared to indicate that my wife, Keli, was spending inordinate amounts of time there, especially versus the kitchen. In addition, it appeared that my daughter, Ellen, was teleporting from school during the day to use the bathroom at home. Obviously, additional improvements in the recognition accuracy were needed.

Finally, during the writing of this article, I added a third installation to watch the road outside my home (named “quiet-water”). I installed the Raspberry Pi and Playstation3 Eye camera in a 6″x6″ plastic electrical box, with a 1.5″ hole drilled in the cover; the camera was placed up-side-down, with the field-of-view reduced from the normal 75 degrees to 56 degrees, which provided a better long view (see ). The motion package provides an option to invert the image and that specification was added for the quiet-water installation.

Figure 15. Average picture and field-of-view (FOV) for road camera

This installation provides a view of the road below our home, which ends at our front-gate and is shared by three other homes. Using the curation user experience developed for the interior locations, a custom model was built to recognize whatever might be seen. Again, the default classifier was initially deployed and those results provided similar results to those results in the kitchen and bathroom, with a preponderance of general-purpose entities that were of no interest. These entities made sense, such as “slope” or “mountainside,” since the full frame was being analyzed. However, the default classifier also identified some entities that were of interest, such as “vehicle.” Additionally, as the new Watson Visual Recognition default classifier includes a hierarchical entity specification, other interesting analysis was possible, for example a histogram of animals were seen (see ). Obviously, unless I am living in a zoo, it is highly improbable that many of those animals were present (for example, “alligator”).

Figure 16. chart of animals by time of day from Watson Visual Recognition results in Db2 Warehouse on Cloud

Training for the quiet-water (road) device ensued, and I encountered challenges in both finding images with only one vehicle in the frame and deciding on names for the entities seen. The images captured included a lot of vehicles; some were known, but many were unknown. Rather than attempt to define a taxonomy of all vehicle types and corresponding manufacturers, years, or models, only the owner and type of car provided sufficient information context.

Classes were quickly created for the cars I owned (martin_c70, martin_c30, martin_suburban) and those of my neighbor (neighbor_hybrid, neighbor_fordflex, neighbor_pickup). Classes for the postal (USPS), package (UPS), and propane delivery vehicles were also easily distinguished and defined. However, additional classes were more difficult to distinguish and define with respect to their context of analysis. Was it more important to define them as “unknown” or “whitepickup” or “car_at_night,” or should there be a hierarchy? Obviously, the process of distinguishing and defining new examples was going to be a continual process. And, I needed to use a shared repository with support for version control, forks, pushes, and pulls. And, as the taxonomy evolved, the associated models were also going to need to be controlled with respect to those definitions and examples.


What you don’t know you don’t know

After this long and winding road, I came to lesson number 3: you don’t know what you don’t know. The Watson Visual Recognition classification taxonomy of entities defines its own concept of success: any entity not in the taxonomy was a failure from the perspective of a service consumer, such as myself. In other words, there were entities that Watson Visual Recognition didn’t know it didn’t know. Similarly, my classifications of family members, pets, and vehicles were also limited by my experience (that is, sufficient pictures of person X or vehicle Y) and by the taxonomy that I had created, hence my need to dynamically add new custom classes as I saw new things. On-going management of the taxonomy, including version management across training, testing, and usage, was going to be required — and potentially very difficult.

This lesson leads to the first corollary: Curation affects outcome. While most developers are aware of the adage, “garbage in, garbage out” in computer programming, in the cognitive realm this adage is the rule. The method, apparatus, process, and adjudication of examples into “ground truth” directly affects the outcome. Making classification mistakes (such as no cat in picture) and lacking prior evidence (such as no pictures of cats) will directly affect the quality and the capabilities of the cognitive service.

In addition to the taxonomy of entities and their example data, there was a latent question of success in performing the primary function, which was notifying the children and grandchildren when things were amiss with the elder persons. The initial presumption of a normal pattern based on detection of a person was not (yet) invalid, but the near sensory overload of potential entities in addition to the questionable nature of the associated predictions made me question my own capabilities to programmatically define appropriate notification conditions.

My wife provided a brief moment of clarity. When she inspected the Excel graphs of “persons” detected in the kitchen, she said, “Why don’t you just draw a line at 10 AM for Sunday; they should be up by then.” Obviously, human intelligence and contextual knowledge of activity should and will be included (at some point), but I inferred from her statement that the measure of success was appropriate notifications, not the ability to track a person by using image classification on IoT devices.

To wit, the second corollary: Optimize your objective functions top-down. In other words, focus collection, curation, and other processes based on utility and user expectations (that is, notification satisfaction). Subordinate components, for example activity analysis, entity classification, motion detection, time intervals, and others, can then be controlled experimentally without limitations from programmatic human prior knowledge (that is, the daily activity of person is the relevant signal on which to act).

The third corollary follows directly from this experimental approach: Enable dynamic subsystem controls (such as how often to capture and process an image). These controls should enable differential operation according to the experiment that you are conducting and should also be recorded along with the experimental output. For example, if the image capture frequency is once per minute, the resulting time-based analysis of detected entities will depend on that frequency for any historical or future comparative analysis. Success needs to be calculated with respect to all those experimental controls.

Finally, historical performance is no guarantee of future performance, and on-going analysis of any cognitive service’s performance should be of paramount importance.

Going real time

Now that basic operational capabilities for the MVP had been established by using Watson Visual Recognition’s custom classifiers, and historical analysis for conditional alerting, the solution was nearly complete in my aspirations for a low-latency, private, sensing, and responding cognitive IoT application at the edge. While the latency for Watson Visual Recognition was acceptable at under 2 seconds, it was not fast enough. In addition, the privacy issue remained, and I needed to eliminate sending images to the IBM Cloud for classification.

As mentioned earlier, there was a new IoT device from nVidia, the Jetson, which included both an ARM processor, like the Raspberry Pi, and the latest nVidia GPU chip. The Jetson was binary-compatible with the Raspberry Pi, but also provided a means to run deep learning software by using the GPU. The next step would be to incorporate a Jetson into the Age-At-Home solution and provide an on-premises means to classify the images with even lower latency.

Thankfully, nVidia provided an open source package called DIGITS that provides a graphical user interface to train deep learning frameworks in a variety of domains, including image classification. In addition, DIGITS provides a means to both download and import the image classification models that were built by using the interface. This software package provided the perfect combination of both capabilities for near real-time inferencing at the edge by using the Jetson, in addition to long-running, batched training and testing of models by using the IBM Cloud bare-metal servers with multiple nVidia GPUs.

The nVidia DIGITS software supports a number of deep learning frameworks, including the U.C. Berkeley (BVLC) Caffe. Unfortunately, open source software does not come with default image classifier, so a foundational image classification needed to be constructed. Fortunately, the open source community provides ImageNet, a collection of 1 million pictures separated into 1000 classifications, and {AlexNet/GoogLeNet} example deep learning neural networks that can learn to classify images.

I used nVidia DIGITS open source software in both cloud and on-premises systems. The cloud system builds models from a foundational corpus of both open-sourced and community-sourced exemplars (such as pictures of cats in kitchens). For more information, view my talk at the nVidia 2016 GTC.


Discovering that my cognitive IoT app needed to learn is the key take away from this PoC. From image classification to daily activity analysis, the breadth and depth of any training set is insufficient to reflect actual use. Training must necessarily include the users’ assessment of success and failure.

The need to minimize up-front bias and build a data set of entities, attributes, and activities observed by noisy sensors, and then attenuate those sensors through machine learning based on user-experience feedback, was a novel pattern. This pattern would need to be repeated to build any successful consumer-facing app that used cognitive processing. That closed-loop feedback mechanism must be enabled for both specialization and in situ training for any cognitive service.

Further work in defining ontologies, establishing ground-truth for those entities, attributes, values and relationships, and curating that corpus for both accuracy and precision needs to be performed. In addition, provenance and associated supply-chain techniques must be employed to “curate the curators” and establish credentials and credibility — both human and cognitive agents.

In all these efforts, additional transparency for inputs, outputs, and performance needs to be established both for successful consumption as well as articulation of what is known and unknown by those cognitive services. All cognitive service consumers should be reticent to select an agent that does not transparently divulge its basis for learning as well as its demonstrated prowess.

Finally, the user is the final arbiter of success of any application, and it is their definition of “what is what” that will dictate whether a cognitive agent provides value. And, it is not only the apparent specialization for a family of people, their animals, and neighborhood vehicles, but more importantly, the subsequent cognitive reasoning or inferences based on those underlying signals, such as sending a notification to a child or care-giver. A false-positive (that is, notification when all is okay) is going to annoy the recipient and the grandparents; a false-negative (that is, no notification when all is not okay) is much worse.

Cognitive IoT application developers must collect these positive and negative ground-truth results into their training sets, including the deterministic and potentially random inputs, the outputs, and the ontology and ground-truth on which the agent was trained. It is from this information that subsequent iterations of training might know more things and also improve performance through more and better examples.