Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.
Rich Hagarty edited this page Feb 1, 2018 · 9 revisions

Short Name

Clickstream analysis using Apache Spark and Apache Kafka

Short Description

Demonstrate how to detect real-time trending topics on popular web sites by collecting data on user visits.

Offering Type

Cognitive and Data Analytics

Introduction

Clickstream analysis is the process of collecting, analyzing, and reporting about which web pages a user visits, and can offer useful information about the usage characteristics of a website. In this Code Pattern, we will utilize clickstream analysis to demonstrate how to detect real-time trending topics on the Wikipedia web site.

Author

by Prashant Sharma and Rich Hagarty

Code

Demo

N/A

Video

Overview

Clickstream analysis is a is the process of collecting, analyzing, and reporting about which web pages a user visits, and can offer useful information about the usage characteristics of a website.

Some popular use cases for clickstream analysis include:

  • A/B Testing - Statistically study how users of a web site are affected by changes from version A to B. Read more

  • Recommendation generation on shopping portals - Click patterns of users of a shopping portal website, indicate how a user was influenced into buying something. This information can be used as a recommendation generation for future such patterns of clicks.

  • Targeted advertisement - Similar to recommendation generation, but tracking user clicks across websites and using that information to target advertisement in real-time and more accurately.

  • Trending topics - Clickstream can be used to study or report trending topics in real time. For a particular time quantum, display top items that gets the highest number of user clicks.

In this Code Pattern, we will demonstrate how to detect real-time trending topics on the Wikipedia web site. To perform this task, Apache Kafka will be used as a message queue, and the Apache Spark structured streaming engine will be used to perform the analytics. This combination is well known for its usability, high throughput and low-latency characteristics.

When you complete this Code Pattern, you will understand how to:

When the reader has completed this journey, they will understand how to:

  • Use Jupyter Notebooks to load, visualize, and analyze data
  • Run Notebooks in IBM Data Science Experience
  • Perform clickstream analysis using Apache Spark Structured Streaming.
  • Build a low-latency processing stream utilizing Apache Kafka.

Flow

  1. User connects with Apache Kafka service and sets up a running instance of a clickstream.
  2. Run a Jupyter Notebook in IBM's Data Science Experience that interacts with the underlying Apack Spark service. Alternatively, this can be done locally by running the Spark Shell.
  3. The Spark service reads and processes data from the Kafka service.
  4. Processed Kafka data is relayed back to the user via the Jupyter Nodebook (or console sink if running locally).

Included Components

  • IBM Data Science Experience: Analyze data using RStudio, Jupyter, and Python in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.
  • Apache Spark: An open-source distributed computing framework that allows you to perform large-scale data processing.
  • Apache Kafka: Kafka is used for building real-time data pipelines and streaming apps. It is designed to be horizontally scalable, fault-tolerant and fast.
  • Jupyter Notebook: An open source web application that allows you to create and share documents that contain live code, equations, visualizations, and explanatory text.
  • Message Hub: A scalable, high-throughput message bus. Wire micro-services together using open protocols.

Featured technologies

  • Cloud: Accessing computer and information technology resources through the Internet.
  • Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.

Links

  • Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns
  • AI and Data Code Pattern Playlist: Bookmark our playlist with all of our Code Pattern videos
  • Data Science Experience: Master the art of data science with IBM's Data Science Experience
  • Spark on IBM Cloud: Need a Spark cluster? Create up to 30 Spark executors on IBM Cloud with our Spark service

Blog

https://ibm.ent.box.com/notes/269287758953