-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Article: Social Media Retrieval System Use Case #169
Article: Social Media Retrieval System Use Case #169
Conversation
|
||
Here are the results ↓ | ||
|
||
:::::tabs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to show the images in different tabs using the archbee
syntax. But it doesn't work directly on GitHub, so I want to be sure it will work on VectorHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking so long to review. This is an extremely high quality piece, I only made some minor suggestions. Thank you very much!
Co-authored-by: Mór Kapronczay <[email protected]>
…ztinpaul/VectorHub into pauliusztin/linkedin-posts-retrieval
@morkapronczay No worries. I am excited to see you like it. I updated the article with all your suggestions and fixed the conflicts. Feel free to let me know if you need anything else or if we can merge it. |
hello @iusztinpaul ! Robert here, working on an edit of this PR. More in an hour or so. Thx! |
@robertdhayanturner sure, I will watch the repo for any updates. |
incorporating all recent changes by author
added LInkedin to "profile"
@iusztinpaul line 52 - we make a point of saying that the retrieval client is decoupled from data ingestion pipeline. line 54 - is avoiding training-serving skew an separate, unrelated point from the decoupling above (line 52)? line 56 - "we can find similar posts using another post or other questions or sentences." line 58 - "the rerank pattern" line 161 - the modularity itself allows us to 1 validate the data at each step, and 2 reuse the code in retrieval, correct? Just making sure it's the modularity that allows 1. lines 161 and 165, these refer to two entirely separate processes, correct? (just making sure) line 180 "this strategy" refers to what strategy specifically? line 431 - check my revision here for accuracy line 606 - Are you comparing query_qdrant_visualization_rerank.png to query_qdrant_visualization.png ? |
Here are the clarifications @robertdhayanturner "line 52 - we make a point of saying that the retrieval client is decoupled from data ingestion pipeline." It is important because the retrieval client and streaming pipeline have two roles. They must be decoupled as they will run in different environments. The streaming pipeline populates the vector DB and constantly listens for new incoming data (running somewhere on the cloud). The retrieval client queries the vector DB for similar results and will be used solely on the client side. "line 54 - is avoiding training-serving skew an separate, unrelated point from the decoupling above (line 52)?" It is related, as the preprocessing steps on the streaming pipeline and retrieval client are the same. Thus, if you don't preprocess the data the same way, you end up with scenarios similar to the training-serving skew. "line 58 - "the rerank pattern"" The rerank pattern is a pattern in itself. This is its name: rerank I know it sounds a little strange, but I don't know how to call it otherwise. You could rephrase it. "line 161 - the modularity itself allows us to 1 validate the data at each step, and 2 reuse the code in retrieval, correct? Just making sure it's the modularity that allows 1." Yes "lines 161 and 165, these refer to two entirely separate processes, correct? (just making sure)" Yes, at line 165, I started a new section. "line 180 "this strategy" refers to what strategy specifically?" The one where we wrap every state of the post into a "line 431 - check my revision here for accuracy" I would rewrite this sentence "We can allay concerns about losing context or meaning, since we can query Qdrant with each chunk and merge the results." into this "We can query Qdrant with each chunk and merge the results." Other than that, it's ok. "line 606 - Are you comparing query_qdrant_visualization_rerank.png to query_qdrant_visualization.png ? Yes, exactly. Hopefully, I made everything clear. Feel free to let me know if you have any other questions. |
@iusztinpaul Thanks, Paul. line 56 - "we can find similar posts using another post or other questions or sentences." |
@iusztinpaul Yes, exactly. So... when I look at these two mappings, it looks like the returned posts in query_qdrant_visualization_rerank.png are further from the query than in query_qdrant_visualization.png. Whereas you say: "While the returned posts aren't very close to the query, they are a lot closer to the query compared to when we weren't reranking the retrieved posts." |
"line 56 - "we can find similar posts using another post or other questions or sentences." Yes, exactly. |
You are right. I switched the diagrams when renaming them. You can switch the names between them. |
multiple edits on article line 514 and 604. images reversed by author on previous commit; query_qdrant_visualization query_qdrant_visualization_rerank switched these, but to avoid confusion will upload new images titled to reflect data represented
small edit
PR that contains an article on how to build a real-time social media retrieval system (used in RAG systems). The article uses LinkedIn data scrapped from my LinkedIn profile, but the example can easily be extended to other types of written data.
The article shows how to build a real-time CDC system between a hypothetical data source (in this article, the data source is simulated with a couple of JSON files) and a Qdrant vector DB using Bytewax as the stream engine.
Within the ingestion pipeline, we show a custom example of how to clean, chunk, and embed LinkedIn posts.
We initially built a standard retrieval client, and afterward, we improved it using the
rerank
pattern. We visualized the query results on a 2D plot using UMAP.The article focuses solely on the retrieval part of an RAG system. Still, it can easily be extended by adding various LLMs and prompt engineering techniques on top of the retrieved LinkedIn posts.