Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

See if git can be used to collaborative tracking Mastodon projects? #17

Closed
maarzt opened this issue Apr 18, 2023 · 3 comments
Closed

See if git can be used to collaborative tracking Mastodon projects? #17

maarzt opened this issue Apr 18, 2023 · 3 comments
Labels
question Further information is requested

Comments

@maarzt
Copy link
Contributor

maarzt commented Apr 18, 2023

Is it possible to use git as a backend for mastodon-sc/mastodon-git#12

Clarify the following questions?

  • Is it possible to version complex file types like Mastodon files with git?
    • Will git merge / git cherry pick / git rebase produce a corrupted Mastodon datasets is the an error message?
    • Is it possible to use a costume merge tool with git?
  • Do we need to be careful about file sizes and repository sizes when versioning Mastodon projects with git?
@maarzt maarzt added the question Further information is requested label Apr 18, 2023
@maarzt
Copy link
Contributor Author

maarzt commented Jul 14, 2023

  • Will git merge / git cherry pick / git rebase produce a corrupted Mastodon datasets is the an error message?

No they will just complain about conflicting binary files.

  • Is it possible to use a costume merge tool with git?

Yes, gitattributes can be used to do this. https://stackoverflow.com/questions/12356917/how-to-set-difftool-mergetool-for-a-specific-file-extension-in-git

@maarzt
Copy link
Contributor Author

maarzt commented Jul 14, 2023

  • Do we need to be careful about file sizes and repository sizes when versioning Mastodon projects with git?

Git is known to perform poorly on large binary files. The biggest mastodon file, seen by me so far, has 50 MB. With roughly 400.000 spots. Storing only 1000 version of that file, without delta compression would produce 50 GB. Making a copy after 10 added spots would lead to a theoretical 1 TB. The recommended size for a git repo is less than 1 GB.

Also the maximum file size in git is 100 MB so, we will soon reach this limit.
There exist several solutions for storing large files in git:

  • git LFS - very mature - centralized, supported by github, gitlab and others
  • git annex - decentralized. Meant mostly as alternative to a cloud? It poorly supports Windows, so
  • dvc - https://dvc.org/ Intended for machine learning

The Mastodon file format is not friendly with regard to delta compression. I did an experiment, opened a large dataset. Saved it to a.mastodon, added a spot, saved it to b.mastodon. I uncompressed the mastodon files. And compared the model.raw file between the two. There are more than a 100_000 bytes different between those two files. A would expect maybe 1000 bytes. My conclusion, and also verified using vbindiff comparing the two files:
image
Most bytes that are different between the two files are probably indices and not actual data. That is likely because Mastodon uses ObjectOutputStream to save the graph.

Conclusion: Mastodon storage file format could be hugely improved in terms of "delta compression friendlyness". Splitting the files into blocks would further reduce load on git. We wouldn't even need to use git LFS.

Is a specialized Mastodon file format require?

A specialized file format, would greatly reduce the need for bandwidth, storage requirements and git LFS. Probably improve performance and offline availability.

SVN is another alternative, but has it's drawbacks of been centralized. An not offline capable. Slightly different approach to branching.

@maarzt
Copy link
Contributor Author

maarzt commented Jul 17, 2023

Open question:

  • Does mastodon change the spot IDs during write oprations? Unfortunately yes. Mastodon rewrites all spot IDs during each writing operation. A small change like deleting one spot, can cause most spot IDs to change. Even adding a spot can under certain circumstances cause the IDs to be changed.
  • Write a demo for a merge tool with distributed files. Test git merge, rebase and cherry-pick
    No experiment was conducted. An idea for a experimental setup would be to version 4 files with git:
    1. hash sum file that is changed with each commit.
    2. a list A
    3. a list B
    4. a file C that contains all entries that are both in list A and list B.
      Keeping changes to these files consistent is challenging, as a change to A or B might or might not require a change to file C. Is it possible to setup git merge tools that allow rebase, merge and cherry-pick without braking the consistency of the files?_
  • Write to Oscar, question him about what tools to use.
    Oscar until now has only a limited understanding of the task. He proposed to instead of storing a large file into git. To upload the file somewhere else ftp / cloud / etc. and store a link in the readme file of the fit repo
  • Is git better with data that can be easily diffed?
    An experiment was run to test this. Growing spot- and link-tables where stored in a binary, uncompressed, chunked format. Git was very efficient storing these tables. Storing the uncompressed tables required 33 MB. The history, that contained 3700 commits required only used 33MB of memory. (".git" folder size)So the history even with many commits, was smaller, due to compression than the original data!
  • Can git use deltas during upload.
    yes. Git uses deltas during upload, both for text and for binary files
  • Is git gc faster if run regularly?
    "git gc" is fast and efficient on chunked, binary, uncompressed tables

@maarzt maarzt closed this as completed Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant