This project explores how Microsoft Fabric and Azure OpenAI can analyze a document repository of text-based data. Microsoft Fabric offers a suite of data analytics tooling. It also offers the OneLake, an automatically provisioned enterprise data lake that can store any type of file, including unstructured text data. The intelligence of Azure OpenAI, combined with strategic prompting, extracts valuable information from text with remarkable efficiency. By combining Microsoft Fabric and Azure Open AI you will be able to analyze yout text-based data like never before, particularly with Spark Notebooks and Power BI! Our document store for this walkthrough will be public domain eBooks.
This project explores the following capabilities of Azure OpenAI, using Microsoft Fabric.
- Entity Extraction
- Text Summarization
- Text Classification
- Text Embeddings and Semantic Similarity
Here is a link to the video submission - this project should take about 1 hour to complete, follow the project steps below to get started
Note: This project leverages data from Project Gutenberg, the first free provider of public domain eBooks. Please consider donating at https://www.gutenberg.org/donate/
Many organizations have a treasure trove of text and pdf documents. These document stores can be massive though, and searching through them manually would be way too time-intensive. Tools used in this project make understanding your data and how it connects extremely efficient. Entity extraction eliminates the need for manually sifting through documents for metadata, text summarization allows you to understand a document much quicker, and text classification groups your data into meaningful categories. For example, a lawyer can use such tooling to find previous court cases that are similar to the one they are working on, without having to manually search through countless documents!
- An Azure subscription
- Contributor access to a Microsoft Fabric workspace
- Access to a Microsoft Fabric F64 capacity or higher (The Fabric trial FT1 sku will not work for direct Azure OpenAI integration). This capacity should be connected to your workspace.
Note: an alternative approach would be to leverage any size Microsoft Fabric sku and a provisioned Azure OpenAI Service
-
Build Lakehouse
Go to (or create) the Fabric workspace that will be used for this project, select the 'Data Science' or 'Data Engineering' experience, and create your Lakehouse. Feel free to name it whatever you like!
Then go back to the workspace. The result should look something like this:
-
Ingest eBook Data onto OneLake
Download or clone this repo to access the Jupyter Notebooks in the scripts folder, then import the 01_data_ingestion_and_prep notebook into your Fabric workspace
If you do not see the Import Notebook option, make sure you are on the Data Science or Data Engineering experience
Open notebook item then click Add Lakehouse. Select 'Existing Lakehouse' then choose the Lakehouse you just created. This will make your Lakehouse the default for this notebook
Run each cell in the notebook and follow along with the markdown. Notice how quickly the spark pool starts! You are given some options to change some parameters but the recommend parameters are already set. This notebook will create the necessary folders, ingest the data from Project Gutenberg, and then prepare the data for use with Azure OpenAI by using Semantic Kernel, more specifically text chunker
After running the script, if you go back to the workspace and open up your Lakehouse, it should look like the following (if it doesnt try hitting the refresh in the top left)
You can explore the data using the Lakehouse explorer
-
Enrich eBook Data using Azure OpenAI
Import the 02_enrich_data_with_AzureOpenAI notebook using the same process as before (including setting the default Lakehouse)
This notebook will be accessing Azure OpenAI resource from directly within Microsoft Fabric. When using an F64 sku or higher notice how you do not need an API key or a provisioned service in Azure! The use of Azure OpenAI is charged against the capacity units on your F64 capacity. AMAZING! With this lightweight, yet extremely powerful, use of Azure OpenAI we will perform the following:
- Entity Extraction
- Text Summarization
- Text Classification
- Generate Embeddings
Run each cell in the notebook examining how each function is using Azure OpenAI prompting
The entire text would be too large to fit into the token window for these models. That is why we are using the chunks we created before, and will use a text reduction technique. We summarize each of the smaller chunks, then take all of our summaries to make a summary of the entire document.
We then use the summary as the input for our classification prompt using the predefined categories in the script
We use a similar technique to the summarization as we do with embeddings. Embeddings are numeric representations of the semantic meaning of text. Here we get the embedding of each chunk, then take the average of all the embeddings. So more simply we are aiming for the average meaning of the entire text. There are more advanced techniques to weight chunks, but here we use an evenly distributed weight amongst chunks
All enriched data is saved back to JSON for future use. Data is also saved as a Lakehouse table to be analyzed with notebooks, SQL, and Power BI! Your Lakehouse should now look something like:
-
Analyze Enriched Data using Notebooks and Power BI
Import the 03_TSNE_data_analysis notebook using the same process as step 2 (including setting the default Lakehouse)
Run the notebook. This notebook will use the embeddings we generated to find 'how' semantically similar each book is based on their cosine similarity. OpenAI ada-002 embeddings have 1536 dimensions which is far too many for humans to visualize. Thus, TSNE gives us a good and human-friendly estimate of 'how' similar these embeddings are. Here is an example from my most recent run:
Already we can see some clumpings of data from these books! This notebook now saves the x and y axis from this visual, then updates the books table with it. This way we can further our analysis with Power BI, which is our next step!
Go to the Lakehouse, then open up the SQL analytics endpoint
Here you can do a variety of analytics. You can write SQL queries, generate visual queries, manage the default semantic data model, and create a new report. The report we will create comes from the default semantic data model of the Lakehouse. It leverages Direct Lake mode which means there are no extra steps we need to do to start using Power BI on top of data lake!
Click new report
This will open up a report connected to the books data we just created
Here is the report I created but feel free to get creative and make your own! To understand how many books were assigned to each category I made a clustered column chart with 'category' on the x-axis and 'Count of book_id' on the y-axis (by clicking the arrow next to book_id in the y-axis you can change the aggregation metric). On the right I made a scatter chart with 'book_id' in the values, 'Sum of x_axis' in the x-axis and 'Sum of y_axis' in the y-axis and 'category' in the legend (I also added zoom sliders from the formatting pane). Then on the bottom I provided the book details in a table visual.
Click 'File' in the top left and save the Power BI report to your workspace, name it whatever you like. Now its time to start using it! Power BI is very interactive, by clicking on any visual it will cross filter others.
In the scatter chart you can box highlight any of the groupings to see what OpenAI thought were similar texts.
In a single Power BI report we can see all of the work we did with Azure OpenAI. Entity extraction gives us valuable metadata such as book title, author, and more. Text summarization shows us the summary in the book details. Text classification is shown in our category bar chart. Semantic similarity is shown through the TSNE visualization. Through this we have turned unstructured text data into meaningful insights!
This concludes the project, thank you for your time. I hope you are now as excited about Microsoft Fabric and Azure OpenAI as I am!
developer: Brighton Kahrs | [email protected]