Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Rewrite #1

Open
jonblack opened this issue Jan 13, 2016 · 2 comments
Open

[EPIC] Rewrite #1

jonblack opened this issue Jan 13, 2016 · 2 comments

Comments

@jonblack
Copy link

jonblack commented Jan 13, 2016

This projects is unwieldly:

  • The core update-databanks script is difficult to understand and poorly commented;
  • No automated tests. Manual testing involves running it in an isolated environment and checking the results by hand;
  • Too many responsibilities. In particular, it is responsible for backing itself up and updating whynot (see [EPIC] Clarify project boundaries whynot2#15).
  • It uses whynot which is by default configured to update the production database, so testing is doubly difficult.

I propose we rewrite the process of updating and generating the databanks to solve the issues above.

This requires a lot of thought to determine what responsibilities should be in this process and which can be managed by separate services. For example, should generated databanks (e.g. hssp) be a separate service that watches for changes in its dependencies/checks to see what's missing?

@jonblack
Copy link
Author

jonblack commented Nov 14, 2016

I've started https://github.com/cmbi/databanks2 since this issue was made but recently I'm wondering if it's the right approach. It uses Makefiles as in this repository (albeit in a much better way).

The problem is that it offers no API, which is useful for solving issues like cmbi/mrs#44 and #3.

@jonblack
Copy link
Author

jonblack commented Feb 1, 2017

The current databanks scripts do not scale at all. Distributed storage and processing platforms offer a much nicer way to process PDB and mmCIF files to create the other databanks. The problem is that all we have are a couple of large supermicrocomputers rather than many commodity servers. Moreover, the network speed is only 100Mbps. The market leader in distributed processing is Hadoop with HDFS; however, this works better for fewer small files, whereas the databanks are composed of many small files.

I'm in favour of moving to a distributed platform like Hadoop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant