add a fileformat documentation stub

Vector35 · Sep 26, 2024 · ee1889c · ee1889c
1 parent 2077f72
commit ee1889c
Show file tree

Hide file tree

Showing 2 changed files with 95 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -10,6 +10,10 @@ Special thanks to [Willi Ballenthin] and [willem] for IDB format research:
 
 TODO
 
+## Documentation
+
+IDB file format documentation: [fileformat.md](doc/fileformat.md).
+
 ## License
 
 This plugin is released under the Apache-2.0 license.
@@ -20,4 +24,4 @@ Dependency licenses can be found [here](https://nightly.link/Vector35/idb-rs/wor
 
 [Willi Ballenthin]:https://github.com/williballenthin
 [willem]:https://github.com/nlitsme
-[cargo-about]:https://github.com/EmbarkStudios/cargo-about/
+[cargo-about]:https://github.com/EmbarkStudios/cargo-about/
diff --git a/doc/fileformat.md b/doc/fileformat.md
@@ -0,0 +1,90 @@
+# IDB format
+
+The idb format consist mainly of a header with the offsets of its sections.
+
+Known extensions are `*.idb` for 32bits version and `*.i64` for the 64bits version.
+
+NOTE: The `section` word is this doc refer to a section of data of the IDB file, and a `binary-section` is the section of data or the original binary (elf, dll, exe) file.
+
+## File overview
+
+```txt
+IDB File             +-----------------------------------------------------------+
+Start of the file    |[ File Header with offsets for all the sections ][ align  ]|
+Offset for Section A |[ Section a Header | Sections A bytes......................|
+                     |...........................................................|
+End for Section A    |..........................................................]|
+Offset for Section B |[ Section a Header | Sections B bytes......................|
+                     |...........................................................|
+End for Section B    |..........................................................]|
+                     +-----------------------------------------------------------+
+```
+
+
+## Sections
+
+The IDB file contains the following sections:
+
+* ID0: Database with most of the metadata.
+* ID1: Binary data and information about each byte.
+* ID2: Unknown data.
+* NAM: Unknown data.
+* TIL: Database of types from known library.
+* SEG: Unknown data.
+
+Each section include a header with the size of it, so it's possible to ensure that sections don't overlap and once parsing the sections
+all the data is parsed or if it contains left-unparsed data.
+
+
+### ID0
+
+The main database of the project, it contains a list of key and values.
+
+It's stored in a btree format, but if you want care about the parsed ID0, it's just a Vector with each entry being `{key: Vec<u8>, value: Vec<u8>}`,
+the vectors is sorted by key.
+
+It's stored in to btree structure, the sections is divided into pages (usually 0x2000 bytes).
+Each page start will contain 0 or more btree entries, each one being a node (points to other pages) or leaf (points to just data).
+
+Each page have entries at the start and the offset of it's key/value also is relative to it's page, usually stored at the end of the page.
+
+It's possible that some data of this section is not parsed, mostly because deleted data is not removed from the file, it's just left unlinked to btree.
+
+Although the id0 data format is simple and very well understand, the data stored inside id0 can be very complex or unknown.
+
+
+### ID1
+
+The bytes and bytes individual information loaded from the original binary file.
+
+It's store sequentially with a page size of (0x2000, aligned or not depending on the version) and the parsed output is just a list of binary-section.
+Each binary-sections start at a specific offset, have all the raw bytes of the binary-section, it also include 24bits of unknown information for each byte.
+
+It's possible that some data of this section is not parsed, because it's was seing in some examples of extra data stored after all the binary-sections are parsed.
+Although this is possibly some vestigial data from the original binary.
+
+
+### ID2
+
+The contents of this data each format is not known at the time.
+
+
+### NAM
+
+The Nam sections is known to contain a list bytes, what this data means is unknown.
+
+It's unlikely that data is left unparsed, mostly because the entire section is parsed, and any in-between data is enforced to be only zeroes.
+
+
+### TIL
+
+The section contains types/macros informations from external libs, like win32, gcc, libc, etc.
+
+This section is most likely always fully parsed, because any extra data will result into error.
+
+NOTE: All IDA versions include a `til` directory in it's instalation folder with multiple til files, those can be used for testing.
+
+
+### SEG
+
+The contents of this data each format is not known at the time.