Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor auto-archiver to use a modular structure for feeders/extractors/enrichers etc. #185

Open
wants to merge 57 commits into
base: main
Choose a base branch
from

Conversation

pjrobertson
Copy link
Collaborator

Also:

  • View a full list of options and config settings by running auto-archiver --help
  • Quicker startup of auto-archiver, by only installing the enabled modules
  • New/additional modules can be enabled/disabled on the fly using command line arguments such as --extractor=new_temporary_extractor
  • You can now run auto-archiver first time without needing to create an orchestration.yaml file first, auto-archiver will set up a basic one so you can get started in no time 🚀
  • You can choose to update your config to file/rewrite your latest command line args to it (so you don't have to keep passing them) by setting the -s or --store flag on the command line
  • All config settings are checked/validated at startup, along with module dependencies, to make sure you have setup modules correctly
  • Improved errors/warnings if there are any config/dependency issues and more helpful guidance when getting started (plus: colour output)
    *You can now log directly to files using the logging: file option. Set the logging level as well using logging: level
  • Fixed up unit tests + added a few new ones
  • Set your own custom module folder with module_paths=/my/own/modules/ to allow you to easily extend auto archiver with new modules. Simply create a new module, place it in that folder then pass the folder path on the command line/save it in your orchestration.yaml

pjrobertson and others added 30 commits January 21, 2025 17:53
(two simple helper functions to convert between dot and dict notation)
# Conflicts:
#	src/auto_archiver/databases/__init__.py
# Conflicts:
#	src/auto_archiver/core/orchestrator.py
erinhmclark and others added 27 commits January 24, 2025 18:51
…ig (e.g. cli_feeder.urls

Use 'do_not_store': True in the config settings to apply this. Also: fix up generic archiver dropins loading + local_storage defaults (same as what's in example orchestration)
… isn't installed by default on most machines)
1. Allow loading modules from --module_paths=/extra/path/here
2. Improved unit tests for module loading
3. Further small tidy ups/clean ups
* Add implementation tests for orchestrator + logging tests
* Standardise method/class vars for extractors to see if they are suitable
* Fix bugs with removing default loguru logger (allows further customisation)
* Fix bug loading required fields from file
*
* Removes (partly) the ArchivingOrchestrator
* Removes the cli_feeder module, and makes it the 'default', allowing you to pass URLs directly on the command line, without having to use the cumbersome --cli_feeder.urls. Just do auto-archiver https://my.url.com
* More unit tests
* Improved error handling
Context for a specific url/item is now passed around via the metadata (metadata.set_context('key', 'val') and metadata.get_context('key', default='something')
The only other thing that was passed around in ArchivingContext was the storage info, which is already accessible now via self.config
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants