Knowledge-driven AI features in a self-service analytics platform
Discover and describe the characteristics of your data with semantic concepts and relationships
In today’s world, enterprises and organizations are facing the growing challenges of looking through their massive data and finding hidden patterns to improve their business performance and drive competitive advantage. Therefore, Artificial Intelligence (AI) becomes increasingly important since it allows computers and software to help humans become faster and smarter at the tasks they’re performing.
IBM Cognos Analytics (11.1) is a state-of-the-art, self-service analytics platform. It introduces many AI-infused features to help you quickly discover hidden insights, recommend visualizations, and make conversation in natural language. In this article, I’ll walk through some main AI-infused features to help you understand what’s happening behind the scenes.
Knowledge discovery explained
Knowledge Discovery Service (KDS) is the cornerstone of AI-infused features. The brain of KDS is Knowledge Discovery Engine (KDE) that comprises classifiers, classification pipelines, and ontologies. It discovers and describes the characteristics of your data with semantic concepts and relationships. The semantic concepts can be domain concepts that describe your business domain, and numeric concepts that describe the nature of data, data distribution, and data quality. KDE also automatically clusters columns as logical groups and discover hierarchies among these groups.
These knowledge discoveries are broadly used by all AI-infused features. And they heavily impact the outcome of these features, including Visualization recommender, Interesting fields recommender, Influencer recommender, Related Visualization, and Auto-join recommender.
Using the example of uploading a file, the engine will analyze the dataset based on its metadata and data, such as column label, cell values, statistics, data distribution, data quality. Next, each column is classified with multiple semantic concepts to represent various characteristics of the column. Although these concepts have not yet been directly exposed to the user in UI, they have been used to define the properties or icon of the column.
Using CA sample data
California_Zip_Website_Visits.xlsx as an example.
Country, State, City, Zip Code column labels have provided lexical hints to Knowledge Discovery Engine (KDE), so they are tagged with Geography related concepts. That seems obvious; however,
Lat Lng doesn’t mean anything by its label. Here comes the powerful capability of KDE classifier. It takes input from multiple layers in the classification pipeline and consolidates them in the final decision making. In this case, the value in the
Lat Lng column provides a hint that it has a coordinate data pattern, also from its data distribution, data type, and so on, KDE has classified the column with a concept Coordinate, and other concepts related to its numeric characteristic.
As another example, if the
Latitude column were filled with String values, regardless of the lexical hints provided by the label, the column would not be tagged with Latitude concept.
The other columns, such as
Web site Visits,
Blog Visits, etc., based on its data type and data distribution, have been tagged with Measure. A hierarchy is automatically formed as Country -> State -> City.
As you can see from these examples, the data quality and tidiness are critical to the quality of knowledge discovery and impact all the other AI-infused recommendations. I will describe some common problem of messy data later that you should avoid when preparing your data.
Let’s have a look at the Properties of the
Latitude column, since concept Latitude is an attribute of concept Geography; therefore,
Geographic location and
Latitude are set as the value of Represents property.
In the Properties of the
Lat Lng column, as its concept Coordinate is an attribute of concept Geography,
Geographic location and
Position are set as the value of Represents property.
Discover your data
Knowledge Discovery Service has two modes: deep or shallow. Why do we need to have deep and shallow mode on knowledge discovery? Let’s have a close look at the key difference between deep mode and shallow mode.
In deep mode, univariate analysis, such as min, max, mean, stddev, median, quartile, etc., and bivariate (pairwise) analysis, will be conducted first to capture the data characteristics, and then KDE will use the result to classify more accurate concepts and relationships. But deep mode takes more time, especially for data sources that contains many tables, such as schema, FM package, etc.
In shallow mode, knowledge discovery happens on the fly based on metadata only, so it is much faster than in deep mode. It is triggered by user gestures in many places, such as adding a new calculation, modifying column name, dragging columns into dashboard canvas, etc. As you can see, these are ad-hoc interactive analysis that require fast response. In shallow mode, Knowledge discovery will make a best attempt but since its lacking discovered data characteristics, some knowledge discovered won’t be as accurate as in deep mode. For example,
Lat Lng column would not be tagged with concept Coordinate without knowing its data characteristics.
Because FM package usually involves a vast number of tables, knowledge discovery on FM package is shallow mode by default so it can be imported faster. Deep mode knowledge discovery can be triggered using “
In IBM Cognos Analytics 11.1, OLAP package, such as PowerCube, DMR, only support shallow mode. When does deep mode knowledge discovery happen? It happens during:
- “Upload file”, “Refresh file” or “Append file“
- “Load” or “Refresh” the data of a data set
- “Enrich package”
- “Load metadata” on a schema of a data server
The knowledge discovered from datasets are broadly used by all AI-infused features and drives the outcome of these features. I’ll describe some of the AI-infused features at a high level to give you an idea of how the knowledge is used in the product.
Interesting fields recommender
Interesting Fields recommender suggests a set of fields that are more noteworthy than the rest of fields in a given analysis context, which can help the user get analysis started faster and guide the user on a discovery path. One of the main criteria in selecting an interesting field is the concepts tagged to the column, and only a subset of concepts are considered.
Influencer recommender suggests a set of fields that have possible causal relationships with a given target field using knowledge discovered, such as concepts, relationships, groups, and statistical analysis results. It not only invented a novel approach of feature reduction for statistical analysis to allow IBM Cognos Analytics to perform advanced analytics at interactive speed, but also enables the system to discover causal associations among the data and avoid suggesting nonsensical relationships purely based on statistical analysis. Some examples of advanced analytics visualizations that use Influencer Recommender are as follows:
When a user loads a multiple worksheets Excel file or a zip file that contains multiple CSV files, the uploaded sheets or files are analyzed by Knowledge Discovery Service, and semantic and data knowledge of each column are discovered. Using this knowledge,
Auto-join recommender further infers deeper insights about the data, and automatically detects join relationships among different files. This enables the user to start right away creating a dashboard or report without having to go through tedious steps to manually join their files together one-by-one.
Many enterprise databases or data warehouses have a huge number of tables. This requires the business user to have a deep understanding about those tables, formal modeling training, and long hours of practice in order to properly build a model for analyzing their data. Combined with the power of
Auto-join recommender, and
Intent-based modeling features allow a user to provide an analysis intent phrase in natural language and propose a model to satisfy the user intent promptly. This can help alleviate the user from the lengthy and complex process of traditional data modeling.
Avoid common problems with messy data
As you’ve seen in the sections above, knowledge discovered from your data drives the outcome of recommendation from AI-fused features. The data quality and tidiness are critical and impact the quality of knowledge discovery and other AI-infused recommendations.
While you are preparing your data, you should try to avoid the following common problems with non-tidy datasets that will affect the outcome of knowledge discovery:
- Mix multiple granularities in one measure column. For example,
- a. This table has Month and Year values are mixed in “Date” column.
- b. “Location/Division” column in the table has mixed three level of locations, City, State, Country.
Mixed granularity measure requires special techniques when building report and dashboard, otherwise it very easy to fall into double or triple counting problem. Also the data patterns provide ambiguous data hints, Year or Month? City or State or Country?
- Column label too long , too descriptive, or like a formula. For example, this column label has 53 words.
“Slide the circle to rate the value of an online chat with an instructor feature. (0 = not valuable and 10 = very valuable) hover here for a definition An online chat feature allows you to contact the instructor using real time chat (like instant messaging) to access a course instructor or facilitator.”
Column label provides lexical hints to KDE, which is vital for classifying a correct domain concept, especially in shallow build mode without any data hints. However, for a label like a paragraph, it will take much longer time to analyze each label in the dataset, and there won’t be any meaningful lexical hints can be inferred.
Compound words with all lower case or all upper case. For example, “fiscalYearBudget” or split the word by underscore or dash will provide lexical hints to the system, but “fiscalyearbudget” will only be treated as one word without semantic meaning.
Anomaly rows that break data pattern. For example: A timestamp value shows in
Closing inventorycolumn that is supposed to only have an integer data pattern.
- Confusing data format, for example,
Customer IDis formatted with thousand separator.
I won’t enumerate all the problems here. From the above examples, you should already have a clear idea of what non-tidy data looks like and will work to fix these issues in your data preparation.