Think 2021: New tools have the developer ecosystem and IBM building together Learn more

Implications of training machine learning models from machine-generated data and human-authored data

IT operations environment generates a lot of data such as logs, metrics, traces, and incident tickets. Some of it is human-authored and some of it is machine-generated data. Logs, metrics, and traces are machine-generated data. The content in incident tickets is typically, if not always, human authored. Other data from software development lifecycle such as code comments, deployment descriptions, and collaboration chat conversations are all human-authored data.

Any analytical solution that monitors this data for identifying anomalies, resolving issues quickly, and proactively avoiding them from happening needs to be able to process and learn patterns from all of these types of data. In IBM Cloud Pak for Watson AIOps, we have built Machine learning models to perform these tasks of anomaly detection, problem diagnosis, and issue avoidance.

While building these models, we noted that each type of data requires a different kind of treatment and preparation to teach machine learning models. In this article, we review in detail the properties, complexities, and implications of training machine learning models from machine-generated data as compared to human-authored data. While the motivation for the article came from our past work on building natural language understanding and speech recognition services and from our present work on IT operations data, we have chronicled human-authored and machine-generated data more broadly to draw more wide-ranging implications.

Data that machine learning models can learn from

Artificial Intelligence (AI) models need data to learn from. This data comes in many forms, especially if it is generated within enterprises. For example, enterprise data could be in PDF documents, spreadsheets, word processing documents, blogs, plain text, conference call audio and video recordings, emails, transaction records, application and system logs, metrics, IT incident, and case ticket data just to name a few.

Some of this data is human-authored while the other is machine-generated. There is also a third category called machine-observed-human-action-generated data. All this data can be either structured, semi-structured, or unstructured. Let’s start by defining all these terms.

First, let’s define the types of data:

  • Human-authored: Human-authored or human-generated content is data that humans write or generate. Emails, word documents, blogs, chat conversations in collaboration tools, audio files, videos, and images are examples of human-generated data. Enterprises have large amounts of human-generated data that has not been leveraged much because this data is unstructured, multimodal (for example, text, images, video, or audio), and hard to process automatically.

  • Machine-generated: Machine-generated data is data that is generated by computer programs, devices, or other mechanisms not triggered by active human action or intervention. However, humans may have designed what, when, where, and how this data gets generated. For example, IT application system logs, sensor device logs, satellite data, weather data, IT systems, and metrics are all examples of machine-generated data that are not caused by a human action.

  • Machine-recorded, human-action-generated data: If software programs are used in surveillance settings where human actions are recorded, can we say that data is machine-generated? We like to categorize that type of data as machine-recorded human-action-generated data. This data is the type of data that is observed and recorded by machines that is caused by human actions. For example, surveillance data recorded by various devices, customer call records that are recorded by phone companies, or online recording software are examples of machine-recorded-human-action-generated data.

Next, let’s define the data formats or structure:

  • Structured data: Structured data is data that is formatted and has a well-defined schema. This data can be generated either by machines or humans. For example, database records are structured data.

  • Unstructured data: Unstructured data is data that doesn’t have a well-defined format or structure. It could be in any modality such as text, audio, or video in any language. For example, chat conversations in collaboration tools like Slack or Teams are unstructured.

  • Semi-structured data: Semi-structured data is data that has some structure but is not as well-defined. For example, the data within structured fields can be free-form text, or URLs, audio, video, or images. Incident tickets are semi-structured data as there is a definite template for the incident ticket whereas the content inside this template can be unstructured.

Each type of data requires different kind of treatment and preparation to teach machine learning models. In this article, we compare and contrast the properties of data that is generated in companies and provide qualitative guidance on what to look for when building machine learning models with human-generated data and machine-generated data.

Examples of human-authored and machine-generated data

Before we discuss the properties of human-authored and machine-generated data, and the associated implications of building machine learning models with them, we present some motivating examples for each kind of data. Both human-generated and machine-generated data can be structured, semi-structured or unstructured. So, we present examples according to these categories. In table 1, we show some examples of enterprise content along these two dimensions.

Type of data Human-authored data Machine-generated data Machine-recorded human-action-generated data

Structured

  • Input data provided by users in web forms or on paper (for example, biometric data such as name, age, gender, salary, address, or preferences)
  • Customer feedback type of data filled out in surveys
  • Sensor data from medical devices, factories (for example, the temperature on the job-shop floor), household gadgets (for example, a thermostat)
  • Location data
  • RFID tag data
  • GPS system data
  • Satellite and telemetry data
  • Software application metrics, transaction records, or user profile records
  • Software code written by programs
  • Web click stream data that gets stored in structured databases
  • ECG, X-ray, CT-Scan type of data
  • Robotic Process Automation (RPA) scripts that machines generate by observing human actions such as clicks.

Semi-structured

  • Incident tickets (human-authored)
  • Root cause analysis reports for IT outage incidents
  • Responses filled out by humans in surveys for specific questions, such as “what can be improved in the process?”
  • Tables with unstructured content in the cells
  • Spreadsheets with unstructured content in the cells
  • Software code written by humans
  • Software application or device logs
  • HTML, XML, or JSON formats generated by software programs
  • Incident ticket data created by humans that is archived in a storage device
  • Any surveys that humans fill out that get stored for further analysis either manually or automatically

Unstructured

  • Emails, text, audio, video, images, documents, or chat conversation logs
  • Virtual assistants enabled to respond in natural language such as text or speech. (for example, Amazon Alexa, Apple’s Siri, or Google’s Google Home respond back with questions or clarifications)
  • Surveillance data in videos, image, or audio formats including Live audio, video, and image feeds.

Properties of human-authored and machine-generated data

Now, let’s compare and contrast the various properties of human and machine-generated data and discuss the implications of building machine learning models with these types of data.

We’ll discuss these properties: topic range, authorship & style, formats, modality, volume of data, language complexity, languages, and language mix.

Topic range

Human-authored data Machine-generated data Machine-recorded human-action-generated data

Large, uncontrolled.

Humans can write about, generate images, audio content, and videos on any topic, or combinations of topics. For example, topics on social media platforms, news articles, or TV programming.

Currently, mostly controlled and scoped.

Machine-generated data is typically generated for a specific purpose and on specific topics designed by humans. For example, IT system logs.

In the near future, the scope of this data will reach the breadth of human-authored data.

With the advancements in machine learning (ML) techniques such as Generative Adversarial Networks (GANs) and Natural Language Generation (NLG), machines are capable of generating images and natural language on various topics. If these systems are put to use they are likely to generate data on as many topics as humans are generating.

Generally the same as machine-generated data.

Implications on machine learning models: Processing and classifying human-authored content is much more complex presently due to the large and uncontrolled scope than processing machine-generated data.

Machine learning models processing human-authored data have to use techniques such as topic modeling and clustering to detect topics.

However, machine-generated content is increasingly growing complex. With GANs-based image generation and natural language generation (including prose and poetry generation) from keywords, the lines will increasingly blur over time making processing machine-generated content just as complex.

Authorship & style

Human-authored data Machine-generated data Machine-recorded human-action-generated data

Multiple people, different styles.

Every individual’s writing style is different. Sometimes, multiple authors contribute to content with differing styles and even different formats.

Controlled & scoped.

Machine-generated content is mostly designed by humans and executed by computer systems or devices in the designed formats and styles.

As AI becomes prominent in more use cases, the number of entities whom can author a document can change.

With AI and machine learning, this is also changing. With GANs, and AI Models for injecting personality, emotions, and communication tones into text, audio, images, and videos, machines are capable of generating content in the style of specific authors or create styles of their own. We are already starting to see this in concepts such as digital twins, AI-news anchors, and holo-graphic virtual assistants. All of these are getting modeled to come with specific personalities or styles. They might even adapt their styles based on the situation just like humans.

Generally the same as machine-generated data.

Implications on machine learning models: Depending on the specific use case, the writing style of the author could be of special interest. For example, in analyzing crime, plagiarism, attributing authorship to anonymous books, and so on. Given the wide variety, analyzing the styles of human-authored content is a challenging problem that requires sophisticated pattern matching.

Machine-generated content can be less in variety and is typically designed by engineers. However, with AI/ML-generated text (prose and poetry) and images, availability of these resources is fast changing again blurring the lines between human-authored and machine-generated content.

Formats

Human-authored data Machine-generated data Machine-recorded human-action-generated data

Free-form or template-based.

Human-authored data can be both free form and template-based depending on the context.

Mostly, template-based.

In IT Operations environments, data could be template-based, or it could be semi-structured, wherein, there is free-form text within a template-based field.

As AI becomes prominent in more use cases, the format of data written can change.

In the natural language processing domain, AI is starting to define its own text generation formats (for example, prose, poetry, or story writing).

In computer vision domain, AI/ML is able to create foreground and background thereby adding to auto-generated formatting styles.

Generally the same as machine-generated data.

Implications on machine learning models: Human-authored free-form templates are much harder to process than template-based formats that are machine-generated.

However, as natural language generation (NLG) matures, and as AI is able to generate natural language in free form formats, the difficulty level may blur in those domains where AI plays a role.

Modality

Human-authored data Machine-generated data Machine-recorded human-action-generated data

Multi-modal: speech, text, images, or videos.

Human-authored data can be in multiple modalities.

  • Speech might include different languages, accents, dialects, language-mixing, or industry and domain vocabulary.
  • Images (with or without captions)
  • Text might include tables, graphs, diagrams, and images and might or might not follow specific pre-defined formats.

Multi-modal: speech, text, images, or videos.

Machine-generated data (especially when it is machine-recorded-human-action-generated-data) can also be in multiple modalities.

  • Synthesized Speech: Synthetic voices on ML-generated content (we are increasingly seeing synthetic voices in news, and infomercials).
  • Images: Images taken of observed human action or synthetic images using Generative Adversarial Networks (GANs)
  • Machine-generated text using natural language summarization or natural language generation techniques.

Generally the same as machine-generated data.

Implications on machine learning models: Human-authored data is much more complex than machine-generated data mostly due to the variety of data, styles, accents, dialects, graphs, or tables involved. While not yet there, machine-generated data is becoming increasingly more complex with generative AI techniques.

Volume of data

Human-authored data Machine-generated data Machine-recorded human-action-generated data

Medium.

The volume of human-authored content is growing rapidly recently due to the availability of social media platforms. .

High. Increasing rapidly.

The number of devices, computers, phones, or software programs that can generate data are growing exponentially. For example, IT systems and applications alone can write terabytes of data each day depending on the size and scope.

Generally the same as machine-generated data.

Implications on machine learning models: The large volume and high rate of growth of machine-generated data poses storage, archiving, and processing challenges in addition to the challenges in preparing that data in an efficient manner for training machine learning models. For example, data might have to be parallel-processed using Spark-like technologies.

Language complexity

Human-authored data Machine-generated data Machine-recorded human-action-generated data

Highly Complex.

Human-authored content tends to be highly complex with aspects such as personality, sentiment, humor, emotions, communication tones, or sarcasm. Also, language can be highly contextual and cultural.

Not as complex as human language

Most types of machine-generated content is designed for specific purposes and tends to be devoid of personality, humor, or emotions.

However, with AI-model-generated images, videos, audio, and text that is generated based on keywords, topics, seed images etc., the notion of nuanced complexity is changing fast. Watson Debater is an example of argument generation (with humor included). AI-based news readers are another example where pre-programmed personality is injected into machine-generated data. As machine understanding of human language nuances increase, we can expect AI systems mimicking, recreating and even inventing their own mannerisms, brand of humor, sarcasm and such nuances.

Generally the same as machine-generated data.

Implications on machine learning models: Detecting the personality, emotions, sentiments, intents, attitudes, humor, sarcasm, and communication tones of humans is a complex problem. Several machine learning models exist to model these attributes of humans including IBM Watson services in the Natural Language Understanding (NLU) portfolio. As with any AI model, they are not foolproof.

While the use cases in IT domain at the moment don’t look for personality in machine-generated data, we are already seeing use cases for personality injection in machine generated content in use cases such as Companion Robots, Conversational AI in therapy, training, persuasion etc.

Languages

Human-authored data Machine-generated data Machine-recorded human-action-generated data

Wide variety.

Human-authored data comes in as many languages as humans speak. However, typically companies do business in ~175-200 official languages. While some artifacts, such as code, can be in standardized formats in English, speech, image captions, and video data can be in any language.

Not as many languages as human-generated data

Machine-generated data is typically standardized to select major languages. English is the default language for machine data. Also, the range and complexity of machine-generated data is much simpler than human-authored data.

Theoretically, machine-generated data can be in as many languages as the humans that designed the machines speak but such variety not as commonly seen.

Generally the same as machine-generated data.

Implications on machine learning models: Understanding human language is hard even for humans. Idioms, humor, and sarcasm are very context and culture specific. We are expecting machines to do as good (or better!) as humans in 170+ languages. Foundational AI services such as speech-to-text, text-to-speech, natural language understanding, natural language generation, or visual image recognition have to deal with many languages. If machine learning models are used to process languages, these models require a significant amount of training data that costs a lot. If rules are used, where applicable, non-trivial subject matter expert time is required to define and refine the rules. In either case, it’s a highly complex problem.

Since machine-generated data is likely to be captured in relatively fewer languages, and the domain is reasonably scoped (to whatever the device or IT system is generating data for), language enablement can be a more contained problem with machine-generated data.

Language mix

Human-authored data Machine-generated data Machine-recorded human-action-generated data

Yes.

It is common for human-authored content to use mixed languages. Especially, English words, phrases, and sentences may be sprinkled throughout other languages.

Yes, but not as complex as human generated data.

Sometimes, machine-generated data could be in mixed languages. For example, IT application logs could be in a combination of Spanish and English. It is possible for logs, messages, or incident ticket content to be in mixed languages. English words, sentences, and phrases may be sprinkled throughout other languages.

Generally the same as machine-generated data.

Implications on machine learning models: When languages are mixed, either in human-authored data or machine-generated data, language detection becomes necessary. One must detect language reliably to process its intent and the content. Even so, processing human-authored data with mixed languages can be much more difficult than machine-generated data in mixed languages. For example, humans tend to concoct whole new words and grammar by mixing languages thereby compounding the problem of language understanding. There is less of a chance of that with machine-generated data.

Teaching machine learning models from human-authored data versus machine-generated data

Based on these properties of human-authored and machine-generated data, let’s now discuss the implications for machine learning models that process these two types of data.

But first, let’s define some terms that we use in explaining machine learning models.

  • Base models: A base model is what an AI service vendor typically prepares and makes available as a service in the general domain. For example, general purpose AI services like sentiment analyzer, speech-to-text service, and image recognition services are typically trained using domain-independent data. These base models are built to have broad coverage and are trained with data from multiple and diverse datasets (publicly available or licensed data sources) to ensure broad coverage. The advantages that these base models offer are that they can get developers or companies started on any dataset and they provide average-to-good accuracy depending on the type of data. Accuracy of these services usually ranges from, say, 75%-85% accuracy give or take +/- 5%-10%.
  • Industry or customized models: While the base models are necessary, they are often insufficient to get the job done when tested on specific domains. For example, a speech-to-text service that is trained on broadcast news might not work well on speech samples from banking and insurance domains because there aren’t enough occurrences of specialized words and their pronunciations in the base training data. So, essentially, the models need to be adapted to the domains that they are expected to work on. In order to make this happen, one must train a domain-specific model. Industry-specific or customized models are those that are typically built on top of base models. The idea of custom models is that by building on top of a good base model, and by training with customized data from a specific domain or industry, one can achieve higher levels of accuracy with smaller amounts of training data. The incremental amount of data that is required to teach it a specific domain or industry vocabulary tends to be smaller than what it had taken to build the original base model. This is what we call as riding the faster learning curves.

Applicability and relevance of general-purpose base models

Human-authored data Machine-generated data Machine-recorded human-action-generated data

Applicable and relevant.

In the case of automatic speech recognition (ASR), image recognition, and natural language understanding types of tasks, base models trained on general purpose or publicly available data are necessary if not sufficient in domain-specific use cases.

For example, a speech-to-text system trained on several thousand hours of audio data from general news broadcasting or spoken chit-chat forms the basis for all domain specific models.

While their accuracy may not be optimal for specific domains, industries, use cases, accents, or dialects, they will work out-of-the-box and will perform reasonably well (roughly 65-70% accuracy on an average to start with). These base models can then be further customized with industry-specific data for finishing the last mile accuracy problem.

Not as applicable and not as relevant in many cases as it is for human-authored data.

Depending on the type of device or software program, the data that is generated and the vocabulary might vary widely. Often, public data is unavailable or, even if it is available, it won’t be applicable to individual scenario-based tasks. Therefore, base models at the device or machine level are generally not applicable for processing machine-generated data.

Machine learning models have to be trained in each scenario with specific domain or use case data. For example, you can’t train an anomaly prediction model from IT application logs of one application in one company and expect it to work for another application either within the same company or another company.

However, at lower levels or granularity, base models can be built ahead and they would be applicable. For example, entity extraction models for extracting entities such as TimeStamps, IPAddresses, PodIds, NodeIds, or JavaErrorStackTraces can be general purpose and can be prebuilt.

Generally the same as machine-generated data.

Pre-trained language models for word-embeddings and features

Human-authored data Machine-generated data Machine-recorded human-action-generated data

Can be pre-trained on human generated language data to improve model accuracy.

Grammar, syntax, semantics, and rules of human languages are well documented and tend to be stable, although popular culture and vocabulary evolves and is susceptible to changes. Therefore, pre-processing and pre-building language models to understand human language in all the modalities is possible. Language models such as BERT are examples of such pre-trained language models. Various experiments done by researchers show that using word embeddings generated from language models as features improves the accuracy of the prediction models.

Can be pre-trained on domain specific data but the rate of influence on accuracy tends to be smaller than in language models with human generated data.

One can pre-train language models with machine-generated data as well. However, AI model accuracy depends more on the specific patterns of data that arrive in a specific context than so much on vocabulary. Therefore, the impact of IT domain-specific pre-trained machine learning models on accuracy tends to be smaller. For example, IT application logs exhibit different patterns depending on the type of application, seasonality, and usage patterns. These factors influence its accuracy more than the vocabulary which tends to be controlled. Our limited experimental results indicate this. We need to do more broad-based comparisons of the rate of increase of accuracy improvements in use cases where language model based pre-trained features are used in human-authored data with the same in use cases involving machine-generated data.

Generally the same as machine-generated data.

Customizing pre-trained language models

Human-authored data Machine-generated data Machine-recorded human-action-generated data

Can be customized. Yields improvements in accuracy.

Further customizing pre-trained language models on top of the language models trained on natural human language (for example, Wikipedia or news) is advisable. The last layers of deep learning models in pre-trained models trained on human languages can be replaced by domain-specific data. For example, one can envision pre-trained speech models for the fast-food restaurant domain, insurance domain, and IT domain.

Can be customized. Yields only slight improvements in accuracy.

For each company, data and vocabulary tend to have a personality of their own. Therefore, further customizing pre-trained models is likely to improve the accuracy of the machine learning models marginally. However, we say only marginally because in machine-generated data vocabulary scope is generally much more limited than in human language. We are yet to conduct experiments to prove this hypothesis.

Generally the same as machine-generated data.

Entity Extraction

Human-authored data Machine-generated data Machine-recorded human-action-generated data

High complexity. Requires different approaches including rules based on the amount of available annotated data to train models.

Extracting entities is an important step in language understanding. For example, extracting the names of people, places, organizations and so on is a critical ingredient to understanding intent. Entity extraction is a complex problem in natural language. It requires grammar parsing or large amounts of annotated data (or both) to train machine learning models. Extracting entities from specialized domains such as healthcare, insurance, legal, and contracts can add additional semantic understanding complexities to extract entities.

Medium to high complexity based on whether metadata is given or not.

Some (if not all) aspects of entity extraction on machine data can be less complex. Since machine-generated data tends to be designed by humans, for the most part, metadata can be obtained for a given data type format and one can simply extract entities from that metadata.

For example, IT application logs, metrics, incident tickets and so on tend to have some standard format from which metadata can be extracted. If not, the exact location of entities within a given topic, at least the larger context and topic can be obtained from metadata. Entities would still need to be detected from within the metadata fields if metadata doesn’t capture it explicitly.

Generally the same as machine-generated data.

Summarization

Human-authored data Machine-generated data Machine-recorded human-action-generated data

High complexity

Summarizing human-authored content is a complex problem. Human-authored content can contain mixed sentiments, pros and cons, compare & contrast, metaphors, idioms, irony, humor, satire and more. machine learning models need to recognize these concepts if not fully understand them to get to the core point of a given piece of content.

Requires reinterpretation based on the data type.

Machine-generated data tends to be more direct and to the point. Most practical use of machine-generated data summarization is in service of either to find deviations from the desired (for example, anomaly detection), or opportunities for improvements (for example, process summarization, process mining, and process analysis) or to find similar patterns as a given machine-generated data pattern so as to retrieve actions taken in a specific setting to resolve an issue (for example, similar incident extraction and next-best action derivation).

For example, in IT operations domain, applications, infrastructure, and network devices produce a lot of log data. The main purpose of emitting those logs is to help detect and debug a problem. Summarization on log data in this setting can be treated as an anomaly detection problem. Summarization on automatically generated incident ticket data can be interpreted as next-best-action derivation to resolve a given incident.

Generally the same as machine-generated data.

Conclusion and next steps

Many consumer and business applications have to process both human-authored and machine-generated data. Many business use cases have a mix of human-authored and machine-generated data. This holds true for IT operations data as well. Therefore, it is important to understand the properties of each kind of data and be prepared to deal with each accordingly.

Generally speaking, human-authored data is more complex than machine-generated data for automated processing and for machine learning models to deal with. In this article, we examined the various properties of human-authored and machine-generated data and discussed the complexities involved in processing both.

Based on our experience and empirical analysis that it is possible to pre-train features using language models with both human-generated data and machine-generated data, but it may not be possible to pre-train machine learning models in use cases involving machine-generated data as it would with human-generated data. Let’s examine these two statements more closely.

  • It is possible to pre-train features and models in use cases involving human-generated data because in any given language human language has a common base, be it as written natural language or spoken speech or videos or images.

    This data is widely, and publicly (if not always freely) available, in sources such as Wikipedia and social media platforms. Using this data, it is possible not only to pre-train features (such as word embeddings) but also pre-train machine learning models that can be considered base or preliminary. These are machine learning models that work out-of-the-box for specific prediction tasks with accuracies ranging from ~60-70% when given sufficiently labeled data for training. For example, we can build an entity recognition model or a speech recognition model that works in any use case setting out-of-the-box today by training features with languages models such as BERT, Fasttext from Wikipedia, or news media data and pre-train models with labeled data from the same or similar sources. The rest of the accuracy gaps can be filled by customizing the machine learning models with use case specific data. For example, a speech recognition model that is trained to recognize general spoken English by a native English speaker may not do well in a healthcare domain specific setting where the doctor is of Indian descent speaking English with an Indian accent. The speech-to-text model needs to be trained with some additional data that contains these properties for it to improve its accuracy. Even so, the base model trained on general purpose audio data could get you more than half-way through.

  • Even though it is possible to pre-train features in use cases involving machine-data, it may not be possible to pre-train machine learning models in use cases involving machine-generated data because the machine-generated data doesn’t have a vast common base like human language.

    Every machine, application, hardware, or device type has its own rhythm and patterns based on how it is setup and configured that is unique in a specific environment. Those patterns can only be learned by training from data from that specific environment. It might not be transferrable to other settings. That is why pre-building base models might not be helpful in use cases with machine-generated data. In fact, pre-training models could work to the detriment of confusing the model with irrelevant data. So, in a sense, all models are custom models. On the other hand, pre-training features may be okay because general vocabulary of that machine domain can be learned via word embeddings. Also, it is possible to build reusable base models. For example, entity extraction models for extracting common entities such as time stamps, IP addresses, pod identifiers, and node identifiers and so on.

While the wide topic range, unrestricted authoring style, free-form formats, multiple modalities, number of languages, and language nuances in human-authored data make it more complex to process data and train machine learning models with, the volume of data generated by machines can add various complexities in data processing, training, and continuous learning stages. This information in our article is a point-in-time observation based on the current state-of-the-art. As natural language generation and image generation techniques (such as GANs) catch on and become more prominent, machine-generated data will likely become just as complex as human-generated data in the future blurring the lines.

Acknowledgements

We would like to thank Laura Chiticariu, Senior Technical Staff Member, Chief Architect for Natural Language Processing at IBM’s Watson division for reviewing this article and providing constructive comments.