-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
onboarding bot #17
Comments
Our notion tends to fall out of date. The source of truth is on GitHub. It would be interesting to have a plugin that will check what seems to be inaccurate or out of date on notion. But architecture seems a bit unclear. Until that's cleared up, I wouldn't want to prioritize making the notion embeddings plugin yet. Same with readmes. I think the better approach is to make a chatbot that you can q&a. Secondly it could update the readmes when it thinks something is not accurate. |
The simple solution is to move the Notion docs into a repo for easier syncing. Creating Notion pages via the API is possible but involves a learning curve, unless GPT-4 can handle it, which it probably can. notion-github-sync example. Alternatively, we could parse text from Notion using the API and convert it into Markdown docs. Respond to Notion DB page change example: We could set something like this up to listen for Notion DB changes and re-run embeddings on the updated content. I understand that GitHub is the source of truth for plugins, active teams, project overviews, comments, etc. However, as far as I know, we don't have any onboarding docs on GitHub.
What info would you feed the chatbot if not from Notion or READMEs? To handle Q&A on high-level org info like UbiquityOS, DevPool, Cards, and DeFi, you'd need more than task specs or comments.
I wasn't referring to updating READMEs. They contain project intent, setup instructions, and references to our architecture, which are great for org-wide context (e.g., what plugins we have, how to install them). However, they don’t cover topics like DevPool, onboarding, recruitment, or investors, which is what Notion docs handle. We need solid text chunks that explain things. Individual comments and task specs aren't enough for an org-aware chatbot. Each embedding references its text source; they aren't merged into a single context. So a user query becomes an embedding and gets compared against issue comments and task specs. For example, "Help me set up the kernel" would likely return task conversations with little value. Similarly, "What is UbiquityOS?" would return technical details instead of a comprehensive overview. Each vector has a size. Larger vectors store more info but are more computationally expensive. We're currently using 1,024 dimensions for all embeddings. That’s fine for small comments, but for entire conversations or codebases, you might want 3,072 dimensions to capture more context. To build a good Q&A chatbot, we need embeddings from full documents, which is how traditional AI chatbots are made. Notion docs and READMEs are already written and could easily power a V1 Q&A chatbot. Eventually, the best approach is to use the embeddings to train or fine-tune our own model so that it has this knowledge built-in, rather than fetching it in real-time. The DAO info would be ideal for this—letting the model start with foundational knowledge and use embeddings for context. Question: Is the vision to have a single chatbot entry (e.g., |
@sshivaditya2019 request for comment regarding chatbot creation, knowledge base etc. Do you agree/disagree with what I have said? You seem to be more knowledge than myself with these things and it's been a while since I built a chatbot so I may be out of touch with it a little. |
@Keyrxng I think it would be beneficial to have a dedicated text corpus from Notion for creating embeddings and conducting similarity searches. You're correct that for an organization-wide intelligent chatbot, we should have multiple text corpora, ideally focused on the DAO or the provider (Essential to prevent model hallucination, ordinary Issue spec, and tasks is simply insufficient) If Notion poses challenges, we could consider using GitHub Pages to maintain Markdown or HTML files for resources. This would offer versioning and serve as a more reliable source of truth for the chatbot. |
Another reason why I have always been against direct messages: we now can pass in historical data into LLMs. We have 90+% of all recent and relevant ideas/conversations/plans of Ubiquity accessible across GitHub comments and telegram org chat messages.
Merge everything remotely relevant into a single context |
@0x4007 How do you envision this Q&A chatbot being used and by who? I've assumed on all platforms and by every demographic we have. Could you maybe show a couple of scenario Q & A's?
A prime example of that here stuffing input of 30k tokens in it starts to lose it's self. guard rails at that context depth seem to go out of the window a little bit unless you step through sections but it's difficult to get the output you want every single time. That's an interesting way to visualize things and the blog is a good read too.
That sets a tone that is not inline with the long terms goals of the DAO re: database dependency, costs etc. Although I've been charged $0.1 for about 18 days worth of embeddings, not that many in total nor very long either.
This is an acceptable loss depending on the context in which the chatbot is being used. If it's DAO and Services then no it should never hallucinate, if it's onboarding like setup and install ofc we have to allow some freedom as it doesn't have codebase knowledge. If we had embeddings of entire codebases then it would be expected that hallucinations would be at an absolute minimum. I read this and this recently when considering codebase embeddings. That's the real task as we need to chunk efficiently (i.e like a with a code parser), handle overlaps to "extend" context between embeddings, etc. I used LangChain before which had a lot of helpful tools for these things but we avoid it as an org. Have you done anything at that sort of scale before? I haven't to be honest. Comments from a mod and a "leader" on the OpenAI forum:
I had previously considered using something like ctags or GNU Global source code tagging system (what IDE language servers etc use) a while back as I think it would go a long way to help produce data rich codebase embeddings since we have no JSDoc type docs, but I haven't researched it practically. |
@UbiquityOS how do I set up and start this project? Sure! Our projects are based off of ts-template which relies on yarn 1.21. Be sure to:
Anyways with a bit of prompting I'm quite certain this will work good enough. I've already done experiments in the past with more primitive models, and no embeddings, that worked fine |
Technically there will be two:
This could be split into two separate tasks or combined as one.
Number two is easy, we run on
push
events and ID any added or changed.md
files and we are done. Notion doc scanning isn't something that we can listen for webhook and go I don't think. So maybe we could have a cron job to run once every 30-60 days and parse the notion docs? I'm sure we can grab the pages from the API with a valid API key.Before we automate notion we need to decide:
dao_info
, but insidemetadata
we can have likesubgroup: recruitment | articles
etc, the more useful metadata we can apply like that the better imo. Why, because if your broad search results are poor you can refine further and have a bit more control of the black box.The text was updated successfully, but these errors were encountered: