Determine trending topics with clickstream analysis  

Use Apache Spark and Kafka to collect, analyze, and report on website visitor data

Last updated | By Prashant Sharma, Rich Hagarty

Description

Clickstream analysis is the process of collecting, analyzing, and reporting about which web pages a user visits, and can offer useful information about the usage characteristics of a website. In this code pattern, we will utilize clickstream analysis to demonstrate how to detect real-time trending topics on the Wikipedia web site.

Overview

Clickstream analysis is a is the process of collecting, analyzing, and reporting about which web pages a user visits, and can offer useful information about the usage characteristics of a website.

Some popular use cases for clickstream analysis include:

  • A/B testing – Statistically study how users of a website are affected by changes from version A to B.
  • Recommendation generation on shopping portals – Click patterns of users of a shopping portal website indicate how a user was influenced into buying something. This information can be used as a recommendation generation for future such patterns of clicks.
  • Targeted advertisement – Similar to recommendation generation, but tracking user clicks across websites and using that information to target advertisement in real time and more accurately.
  • Trending topics – Clickstream analysis can be used to study or report trending topics in real time. For a particular time quantum, display top items that get the highest number of user clicks.

In this code pattern, we will demonstrate how to detect real-time trending topics on Wikipedia. To perform this task, Apache Kafka will be used as a message queue, and the Apache Spark structured streaming engine will be used to perform the analytics. This combination is well known for its usability, high throughput, and low-latency characteristics.

When you complete this pattern, you will understand how to:

  • Use Jupyter Notebooks to load, visualize, and analyze data.
  • Run Jupyter Notebooks in IBM Watson Studio.
  • Perform clickstream analysis using Apache Spark structured streaming.
  • Build a low-latency processing stream utilizing Apache Kafka.

Flow

  1. User connects with Apache Kafka service and sets up a running instance of a clickstream.
  2. Run a Jupyter Notebook in IBM Data Science Experience that interacts with the underlying Apache Spark service. Alternatively, this can be done locally by running the Spark Shell.
  3. The Apache Spark service reads and processes data from the Apache Kafka service.
  4. Processed Kafka data is relayed back to the user via the Jupyter Notebook (or console sink if running locally).

Related Blogs

Live analytics with an event store fed from Java and analyzed in Jupyter Notebook

Event-driven analytics requires a data management system that can scale to allow a high rate of incoming events while optimizing to allow immediate analytics. IBM Db2 Event Store extends Apache Spark to provide accelerated queries and lightning fast inserts. This code pattern is a simple introduction to get you started with event-driven analytics. You can...

Continue reading Live analytics with an event store fed from Java and analyzed in Jupyter Notebook

Creating an augmented reality résumé using Core ML and Watson Visual Recognition

Overview In June 2017, at the Apple Worldwide Developers Conference (WWDC), Apple announced that ARKit would be available in iOS 11. To highlight how IBM’s Watson services can be used with Apple’s ARKit, I created a code pattern that matches a person’s face using Watson Visual Recognition and Core ML. The app then retrieves information...

Continue reading Creating an augmented reality résumé using Core ML and Watson Visual Recognition

Related Links

Playlist

Playlist with all of our Code Pattern videos

Apache Spark

Spark on IBM Cloud: Need a Spark cluster? Create up to 30 Spark executors on IBM Cloud with our Spark service