Article: Social Media Retrieval System Use Case #169

iusztinpaul · 2024-01-24T07:06:24Z

PR that contains an article on how to build a real-time social media retrieval system (used in RAG systems). The article uses LinkedIn data scrapped from my LinkedIn profile, but the example can easily be extended to other types of written data.

The article shows how to build a real-time CDC system between a hypothetical data source (in this article, the data source is simulated with a couple of JSON files) and a Qdrant vector DB using Bytewax as the stream engine.

Within the ingestion pipeline, we show a custom example of how to clean, chunk, and embed LinkedIn posts.

We initially built a standard retrieval client, and afterward, we improved it using the rerank pattern. We visualized the query results on a 2D plot using UMAP.

The article focuses solely on the retrieval part of an RAG system. Still, it can easily be extended by adding various LLMs and prompt engineering techniques on top of the retrieved LinkedIn posts.

iusztinpaul · 2024-02-05T14:06:02Z

docs/use_cases/social_media_retrieval.md

+
+Here are the results ↓
+
+:::::tabs


I am trying to show the images in different tabs using the archbee syntax. But it doesn't work directly on GitHub, so I want to be sure it will work on VectorHub.

morkapronczay

Sorry for taking so long to review. This is an extremely high quality piece, I only made some minor suggestions. Thank you very much!

docs/use_cases/social_media_retrieval.md

Co-authored-by: Mór Kapronczay <[email protected]>

…ztinpaul/VectorHub into pauliusztin/linkedin-posts-retrieval

iusztinpaul · 2024-02-23T15:14:39Z

@morkapronczay No worries. I am excited to see you like it.

I updated the article with all your suggestions and fixed the conflicts.

Feel free to let me know if you need anything else or if we can merge it.

robertdhayanturner · 2024-02-23T16:11:22Z

hello @iusztinpaul ! Robert here, working on an edit of this PR.
per @morkapronczay 's comments, this is a GREAT article, so I will also only have a few questions for you, which I'll get to you today. Then once we're clear on those, you can check the whole article to make sure it's good, and we'll be done.

More in an hour or so. Thx!

iusztinpaul · 2024-02-23T17:21:14Z

@robertdhayanturner sure, I will watch the repo for any updates.

incorporating all recent changes by author

added LInkedin to "profile"

robertdhayanturner · 2024-02-23T19:22:44Z

@iusztinpaul
per earlier, great article! Just a few questions for clarification:
I can take care of the edits, just need your input.

line 52 - we make a point of saying that the retrieval client is decoupled from data ingestion pipeline.
If decoupling is important, we should say why...
(e.g. for scalability, fault isolation, dev and maintenance, flexibility?)

line 54 - is avoiding training-serving skew an separate, unrelated point from the decoupling above (line 52)?

line 56 - "we can find similar posts using another post or other questions or sentences."
Do you mean you can retrieve similar posts using a variety of query types - e.g., posts, questions, sentences...?

line 58 - "the rerank pattern"
"The" implies a specific rerank pattern. Which rerank pattern?

line 161 - the modularity itself allows us to 1 validate the data at each step, and 2 reuse the code in retrieval, correct? Just making sure it's the modularity that allows 1.

lines 161 and 165, these refer to two entirely separate processes, correct? (just making sure)

line 180 "this strategy" refers to what strategy specifically?

line 431 - check my revision here for accuracy

line 606 - Are you comparing query_qdrant_visualization_rerank.png to query_qdrant_visualization.png ?
I think maybe I misunderstood...

iusztinpaul · 2024-02-24T15:19:30Z

Here are the clarifications @robertdhayanturner

"line 52 - we make a point of saying that the retrieval client is decoupled from data ingestion pipeline."

It is important because the retrieval client and streaming pipeline have two roles. They must be decoupled as they will run in different environments. The streaming pipeline populates the vector DB and constantly listens for new incoming data (running somewhere on the cloud). The retrieval client queries the vector DB for similar results and will be used solely on the client side.

"line 54 - is avoiding training-serving skew an separate, unrelated point from the decoupling above (line 52)?"

It is related, as the preprocessing steps on the streaming pipeline and retrieval client are the same. Thus, if you don't preprocess the data the same way, you end up with scenarios similar to the training-serving skew.

"line 58 - "the rerank pattern""

The rerank pattern is a pattern in itself. This is its name: rerank I know it sounds a little strange, but I don't know how to call it otherwise. You could rephrase it.

"line 161 - the modularity itself allows us to 1 validate the data at each step, and 2 reuse the code in retrieval, correct? Just making sure it's the modularity that allows 1."

Yes

"lines 161 and 165, these refer to two entirely separate processes, correct? (just making sure)"

Yes, at line 165, I started a new section.

"line 180 "this strategy" refers to what strategy specifically?"

The one where we wrap every state of the post into a pydantic model

"line 431 - check my revision here for accuracy"

I would rewrite this sentence "We can allay concerns about losing context or meaning, since we can query Qdrant with each chunk and merge the results." into this "We can query Qdrant with each chunk and merge the results."

Other than that, it's ok.

"line 606 - Are you comparing query_qdrant_visualization_rerank.png to query_qdrant_visualization.png ?
I think maybe I misunderstood..."

Yes, exactly.

Hopefully, I made everything clear. Feel free to let me know if you have any other questions.

robertdhayanturner · 2024-02-25T12:59:56Z

@iusztinpaul Thanks, Paul.
You missed one:

line 56 - "we can find similar posts using another post or other questions or sentences."
Do you mean you can retrieve similar posts using a variety of query types - e.g., posts, questions, sentences...?

robertdhayanturner · 2024-02-25T13:21:58Z

@iusztinpaul
"line 606 - Are you comparing query_qdrant_visualization_rerank.png to query_qdrant_visualization.png ?
I think maybe I misunderstood..."

Yes, exactly.

So... when I look at these two mappings, it looks like the returned posts in query_qdrant_visualization_rerank.png are further from the query than in query_qdrant_visualization.png. Whereas you say: "While the returned posts aren't very close to the query, they are a lot closer to the query compared to when we weren't reranking the retrieved posts."

iusztinpaul · 2024-02-25T16:04:55Z

@robertdhayanturner

"line 56 - "we can find similar posts using another post or other questions or sentences."
Do you mean you can retrieve similar posts using a variety of query types - e.g., posts, questions, sentences...?"

Yes, exactly.

iusztinpaul · 2024-02-25T16:08:47Z

@robertdhayanturner

"line 606 - Are you comparing query_qdrant_visualization_rerank.png to query_qdrant_visualization.png ?
I think maybe I misunderstood..."

Yes, exactly.

So... when I look at these two mappings, it looks like the returned posts in query_qdrant_visualization_rerank.png are further from the query than in query_qdrant_visualization.png. Whereas you say: "While the returned posts aren't very close to the query, they are a lot closer to the query compared to when we weren't reranking the retrieved posts."

You are right. I switched the diagrams when renaming them. You can switch the names between them.

multiple edits on article line 514 and 604. images reversed by author on previous commit; query_qdrant_visualization query_qdrant_visualization_rerank switched these, but to avoid confusion will upload new images titled to reflect data represented

small edit

iusztinpaul added 18 commits January 24, 2024 09:04

docs: Initialize blog file

de4c493

docs: Add contribuitors

e97934e

docs: Social media retrieval article introduction

6675286

docs: Finish streaming pipeline section

3fdb383

docs: Add retrieval client draft

5c139f3

docs: Add exampleS

0f5f26a

docs: Add exampleS

8b2849a

docs: Add exampleS

915721b

docs: Add all the results

396a8ba

docs: Add TODOs

ab36519

docs: Fix image

5aafd64

docs: Add diagrams

b1c6879

fix: diagrams

0645375

fix: diagrams

249ce69

fix: Grammatic errors

89e0dd2

docs: Remove TODOs

dbf8318

docs: Final check improvements

dca6e13

docs: Final check improvements

7af4291

iusztinpaul changed the title ~~Social Media Retrieval System Use Case (WIP)~~ Social Media Retrieval System Use Case Feb 5, 2024

iusztinpaul marked this pull request as ready for review February 5, 2024 14:02

ClaireSuperlinked added the stage: content review PR under review of the high level content direction label Feb 5, 2024

iusztinpaul commented Feb 5, 2024

View reviewed changes

iusztinpaul changed the title ~~Social Media Retrieval System Use Case~~ Article: Social Media Retrieval System Use Case Feb 6, 2024

iusztinpaul added 4 commits February 7, 2024 14:59

docs: Add link to code

6354e95

docs: Add link to code

34f8849

docs: Add link to code

6e613ad

docs: Add link to code

fa89010

morkapronczay approved these changes Feb 22, 2024

View reviewed changes

docs/use_cases/social_media_retrieval.md Outdated Show resolved Hide resolved

docs/use_cases/social_media_retrieval.md Show resolved Hide resolved

docs/use_cases/social_media_retrieval.md Show resolved Hide resolved

morkapronczay added stage: style review PR under review for style guide compliance ( https://hub.superlinked.com/contributing ) and removed stage: content review PR under review of the high level content direction labels Feb 22, 2024

iusztinpaul and others added 4 commits February 23, 2024 17:00

Update docs/use_cases/social_media_retrieval.md

0f40d12

Co-authored-by: Mór Kapronczay <[email protected]>

docs: Update article with suggestions

350ba71

Merge branch 'pauliusztin/linkedin-posts-retrieval' of github.com:ius…

95f1aa1

…ztinpaul/VectorHub into pauliusztin/linkedin-posts-retrieval

fix: Conflicts

f34e65d

robertdhayanturner added 2 commits February 23, 2024 14:13

Update social_media_retrieval.md

fc9bd9a

incorporating all recent changes by author

Update social_media_retrieval.md

0ff3c79

added LInkedin to "profile"

robertdhayanturner added 3 commits February 25, 2024 15:42

Update social_media_retrieval.md

fed4061

multiple edits on article line 514 and 604. images reversed by author on previous commit; query_qdrant_visualization query_qdrant_visualization_rerank switched these, but to avoid confusion will upload new images titled to reflect data represented

Update social_media_retrieval.md

b7eb5f8

small edit

Merge branch 'main' into pauliusztin/linkedin-posts-retrieval

3b481fe

robertdhayanturner merged commit 4d36d63 into superlinked:main Feb 25, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Article: Social Media Retrieval System Use Case #169

Article: Social Media Retrieval System Use Case #169

iusztinpaul commented Jan 24, 2024 •

edited

Loading

iusztinpaul Feb 5, 2024

morkapronczay left a comment

iusztinpaul commented Feb 23, 2024 •

edited

Loading

robertdhayanturner commented Feb 23, 2024 •

edited

Loading

iusztinpaul commented Feb 23, 2024

robertdhayanturner commented Feb 23, 2024 •

edited

Loading

iusztinpaul commented Feb 24, 2024 •

edited

Loading

robertdhayanturner commented Feb 25, 2024

robertdhayanturner commented Feb 25, 2024

iusztinpaul commented Feb 25, 2024

iusztinpaul commented Feb 25, 2024

Article: Social Media Retrieval System Use Case #169

Article: Social Media Retrieval System Use Case #169

Conversation

iusztinpaul commented Jan 24, 2024 • edited Loading

iusztinpaul Feb 5, 2024

Choose a reason for hiding this comment

morkapronczay left a comment

Choose a reason for hiding this comment

iusztinpaul commented Feb 23, 2024 • edited Loading

robertdhayanturner commented Feb 23, 2024 • edited Loading

iusztinpaul commented Feb 23, 2024

robertdhayanturner commented Feb 23, 2024 • edited Loading

iusztinpaul commented Feb 24, 2024 • edited Loading

robertdhayanturner commented Feb 25, 2024

robertdhayanturner commented Feb 25, 2024

iusztinpaul commented Feb 25, 2024

iusztinpaul commented Feb 25, 2024

iusztinpaul commented Jan 24, 2024 •

edited

Loading

iusztinpaul commented Feb 23, 2024 •

edited

Loading

robertdhayanturner commented Feb 23, 2024 •

edited

Loading

robertdhayanturner commented Feb 23, 2024 •

edited

Loading

iusztinpaul commented Feb 24, 2024 •

edited

Loading