Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Article: Social Media Retrieval System Use Case #169

Conversation

iusztinpaul
Copy link
Contributor

@iusztinpaul iusztinpaul commented Jan 24, 2024

PR that contains an article on how to build a real-time social media retrieval system (used in RAG systems). The article uses LinkedIn data scrapped from my LinkedIn profile, but the example can easily be extended to other types of written data.

The article shows how to build a real-time CDC system between a hypothetical data source (in this article, the data source is simulated with a couple of JSON files) and a Qdrant vector DB using Bytewax as the stream engine.

Within the ingestion pipeline, we show a custom example of how to clean, chunk, and embed LinkedIn posts.

We initially built a standard retrieval client, and afterward, we improved it using the rerank pattern. We visualized the query results on a 2D plot using UMAP.

The article focuses solely on the retrieval part of an RAG system. Still, it can easily be extended by adding various LLMs and prompt engineering techniques on top of the retrieved LinkedIn posts.

@iusztinpaul iusztinpaul changed the title Social Media Retrieval System Use Case (WIP) Social Media Retrieval System Use Case Feb 5, 2024
@iusztinpaul iusztinpaul marked this pull request as ready for review February 5, 2024 14:02
@ClaireSuperlinked ClaireSuperlinked added the stage: content review PR under review of the high level content direction label Feb 5, 2024

Here are the results ↓

:::::tabs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to show the images in different tabs using the archbee syntax. But it doesn't work directly on GitHub, so I want to be sure it will work on VectorHub.

@iusztinpaul iusztinpaul changed the title Social Media Retrieval System Use Case Article: Social Media Retrieval System Use Case Feb 6, 2024
Copy link
Contributor

@morkapronczay morkapronczay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to review. This is an extremely high quality piece, I only made some minor suggestions. Thank you very much!

docs/use_cases/social_media_retrieval.md Outdated Show resolved Hide resolved
docs/use_cases/social_media_retrieval.md Show resolved Hide resolved
docs/use_cases/social_media_retrieval.md Show resolved Hide resolved
@morkapronczay morkapronczay added stage: style review PR under review for style guide compliance ( https://hub.superlinked.com/contributing ) and removed stage: content review PR under review of the high level content direction labels Feb 22, 2024
@iusztinpaul
Copy link
Contributor Author

iusztinpaul commented Feb 23, 2024

@morkapronczay No worries. I am excited to see you like it.

I updated the article with all your suggestions and fixed the conflicts.

Feel free to let me know if you need anything else or if we can merge it.

@robertdhayanturner
Copy link
Collaborator

robertdhayanturner commented Feb 23, 2024

hello @iusztinpaul ! Robert here, working on an edit of this PR.
per @morkapronczay 's comments, this is a GREAT article, so I will also only have a few questions for you, which I'll get to you today. Then once we're clear on those, you can check the whole article to make sure it's good, and we'll be done.

More in an hour or so. Thx!

@iusztinpaul
Copy link
Contributor Author

@robertdhayanturner sure, I will watch the repo for any updates.

incorporating all recent changes by author
added LInkedin to "profile"
@robertdhayanturner
Copy link
Collaborator

robertdhayanturner commented Feb 23, 2024

@iusztinpaul
per earlier, great article! Just a few questions for clarification:
I can take care of the edits, just need your input.

line 52 - we make a point of saying that the retrieval client is decoupled from data ingestion pipeline.
If decoupling is important, we should say why...
(e.g. for scalability, fault isolation, dev and maintenance, flexibility?)

line 54 - is avoiding training-serving skew an separate, unrelated point from the decoupling above (line 52)?

line 56 - "we can find similar posts using another post or other questions or sentences."
Do you mean you can retrieve similar posts using a variety of query types - e.g., posts, questions, sentences...?

line 58 - "the rerank pattern"
"The" implies a specific rerank pattern. Which rerank pattern?

line 161 - the modularity itself allows us to 1 validate the data at each step, and 2 reuse the code in retrieval, correct? Just making sure it's the modularity that allows 1.

lines 161 and 165, these refer to two entirely separate processes, correct? (just making sure)

line 180 "this strategy" refers to what strategy specifically?

line 431 - check my revision here for accuracy

line 606 - Are you comparing query_qdrant_visualization_rerank.png to query_qdrant_visualization.png ?
I think maybe I misunderstood...

@iusztinpaul
Copy link
Contributor Author

iusztinpaul commented Feb 24, 2024

Here are the clarifications @robertdhayanturner

"line 52 - we make a point of saying that the retrieval client is decoupled from data ingestion pipeline."

It is important because the retrieval client and streaming pipeline have two roles. They must be decoupled as they will run in different environments. The streaming pipeline populates the vector DB and constantly listens for new incoming data (running somewhere on the cloud). The retrieval client queries the vector DB for similar results and will be used solely on the client side.

"line 54 - is avoiding training-serving skew an separate, unrelated point from the decoupling above (line 52)?"

It is related, as the preprocessing steps on the streaming pipeline and retrieval client are the same. Thus, if you don't preprocess the data the same way, you end up with scenarios similar to the training-serving skew.

"line 58 - "the rerank pattern""

The rerank pattern is a pattern in itself. This is its name: rerank I know it sounds a little strange, but I don't know how to call it otherwise. You could rephrase it.

"line 161 - the modularity itself allows us to 1 validate the data at each step, and 2 reuse the code in retrieval, correct? Just making sure it's the modularity that allows 1."

Yes

"lines 161 and 165, these refer to two entirely separate processes, correct? (just making sure)"

Yes, at line 165, I started a new section.

"line 180 "this strategy" refers to what strategy specifically?"

The one where we wrap every state of the post into a pydantic model

"line 431 - check my revision here for accuracy"

I would rewrite this sentence "We can allay concerns about losing context or meaning, since we can query Qdrant with each chunk and merge the results." into this "We can query Qdrant with each chunk and merge the results."

Other than that, it's ok.

"line 606 - Are you comparing query_qdrant_visualization_rerank.png to query_qdrant_visualization.png ?
I think maybe I misunderstood..."

Yes, exactly.


Hopefully, I made everything clear. Feel free to let me know if you have any other questions.

@robertdhayanturner
Copy link
Collaborator

@iusztinpaul Thanks, Paul.
You missed one:

line 56 - "we can find similar posts using another post or other questions or sentences."
Do you mean you can retrieve similar posts using a variety of query types - e.g., posts, questions, sentences...?

@robertdhayanturner
Copy link
Collaborator

@iusztinpaul
"line 606 - Are you comparing query_qdrant_visualization_rerank.png to query_qdrant_visualization.png ?
I think maybe I misunderstood..."

Yes, exactly.

So... when I look at these two mappings, it looks like the returned posts in query_qdrant_visualization_rerank.png are further from the query than in query_qdrant_visualization.png. Whereas you say: "While the returned posts aren't very close to the query, they are a lot closer to the query compared to when we weren't reranking the retrieved posts."

@iusztinpaul
Copy link
Contributor Author

@robertdhayanturner

"line 56 - "we can find similar posts using another post or other questions or sentences."
Do you mean you can retrieve similar posts using a variety of query types - e.g., posts, questions, sentences...?"

Yes, exactly.

@iusztinpaul
Copy link
Contributor Author

@robertdhayanturner

"line 606 - Are you comparing query_qdrant_visualization_rerank.png to query_qdrant_visualization.png ?
I think maybe I misunderstood..."

Yes, exactly.

So... when I look at these two mappings, it looks like the returned posts in query_qdrant_visualization_rerank.png are further from the query than in query_qdrant_visualization.png. Whereas you say: "While the returned posts aren't very close to the query, they are a lot closer to the query compared to when we weren't reranking the retrieved posts."

You are right. I switched the diagrams when renaming them. You can switch the names between them.

multiple edits on article

line 514 and 604. images reversed by author on previous commit;
query_qdrant_visualization
query_qdrant_visualization_rerank

switched these, but to avoid confusion will upload new images titled to reflect data represented
@robertdhayanturner robertdhayanturner merged commit 4d36d63 into superlinked:main Feb 25, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage: style review PR under review for style guide compliance ( https://hub.superlinked.com/contributing )
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants