Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize remote db #41

Merged
merged 14 commits into from
May 17, 2024
Merged

Optimize remote db #41

merged 14 commits into from
May 17, 2024

Conversation

granawkins
Copy link
Member

The basic approach is:

  • Store document in the graph, not just the database. This means the .json files saved will be quite a bit bigger, but never more than 2x the size of the codebase, which isn't crazy, each time it updates it will overwrite the previous version instead of saving a new one.
  • Do all database operations as batches

Seems to be all working alright on smaller repos, and I'm getting the expected speedup when working with the remote database.

Copy link

@mentatai mentatai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the pull request aims to reduce direct database operations within annotators and store more data in memory, which aligns with the approach of writing and reading less often from the database to improve performance. The proposed changes appear to align with these goals by modifying method signatures and internal logic to eliminate unnecessary database interactions. It's important though to ensure all related changes are fully implemented across the entire codebase to maintain consistency and functionality.

Butler is in closed beta. Reply with feedback or to ask Butler to review other parts of the PR. Please give feedback with emoji reacts.

# Save to db and graph
metadatas = record["metadatas"][0]
metadatas[self.call_field_id] = json.dumps(calls)
db.update(data["checksum"], metadatas=metadatas)
data[self.call_field_id] = calls
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing removal of db parameter as suggested in docstring change at line 155. This should be consistent across the class if database interactions are removed as per the changes in other parts of the code.

Suggested change
data[self.call_field_id] = calls
async def annotate(
self, graph: KnowledgeGraph, refresh: bool = False,
) -> KnowledgeGraph:

@@ -212,7 +211,6 @@ def is_complete(self, graph: KnowledgeGraph, db: Database) -> bool:
document, context = get_document_and_context(
node,
graph,
db,
summary_field_id=self.summary_field_id,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The db parameter is still mentioned in the annotate method whereas it appears to be unused considering recent changes that focus on reducing direct database operations within annotators.

Suggested change
summary_field_id=self.summary_field_id,
async def annotate(
self, graph: KnowledgeGraph, refresh: bool = False
) -> KnowledgeGraph:

@granawkins granawkins merged commit 781e8fb into main May 17, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant