Filters now can only be implemented as C code and compiled as part of the main library, they should be:
-
loadable as an external module from the .bp file (via an import declaration)
-
created as a compiled object plugin (loaded with dlopen) or as a script of chosen language, that responds to a precise plugin interface. Compiled objects should obey the same rules than builtin filters in filters/ directory.
Views are merging different parts of a file (or filtered contents) together as a single extent of bytes.
This can typically be used to access contents encapsulated in a transport stream.
A view provides custom backend functions to access ranges of bytes (generalizes the 'struct bitpunch_data_source' type).
BP syntax TBD (probably with one or more builtin functions).
Define a syntax a la XPath to query a set of dpaths from an expression
This will allow those files to be manipulated directly without having to write custom .bp files.
As elements in binaries can be chained structurally, although being logically considered as an ordered set of items, a "list" view would implement array-style access on a chained list of items.
A list would be defined by:
- a "dpath" to the first element
- a "dpath" to the next element from inside an item
- use of already-implemented "last" keyword in the last item
A bitpunch schema file could contain array declarations, e.g. [1,2,3], that could be used along with arrays generated from data.
Same idea for dictionaries (key->value association, where key is a string and value can be any bitpunch object).
That would allow a user to e.g. ask bitpunch what does this 0xC001C0DE bit at offset 0x123456 is for, returning a dpath that represented as a path string would give something like "blocks[42].foo.bar[3].magic", essentially telling that it is "the magic number of 3rd element of array defined by field 'bar' in the structure defined by field 'foo' of the 42th block in the file". That will definitely make his day.
Although it looks scary at first to implement, it should not be too complex considering all the machinery already implemented to track items and their locations. It would consist essentially of browsing structures and arrays recursively, stopping at each level when a match between the browsed offsets and requested offsets is reached. May branch on unions though (keep this for later!) or if the requested range spans multiple items.
Any new language binding is welcome to complete the existing C and Python APIs (C++, JavaScript etc.)
A goal is to have bindings well integrated with the language features, not just mappings to the C API function calls (so it's more work, but it's worth it!).
CLI command + API call to do a complete sanity check of a file (can be implemented by forcing a browse through all structures and gathering errors encountered in a list)
Add ability to search for some sequence of bytes inside a particular set of dpath expressions, and return at which dpath and which offsets they are found.
Add support for regexp matching operator in expressions
For now it's based on plain Makefile, why not but currently it does not even allow to install the software.
For random ideas that came along the way, or more code-specific stuff, see Haystack below or github issues.
Example projects:
-
A hex visualizer/editor that can highlight structure using the bitpunch lib
-
A graphical user interface that can display and extract info in a very interactive way
One of the big upcoming targeted features is adding write capability, so that conformant files can be generated from modified input files or created from scratch from any comprehensive higher level description of their contents.
Of course, the implementation can start with a limited set of supported BP schema features and reject those outside the implemented scope, then augmenting the supported range bit by bit.
A central repository to store and lookup specific format files may be built.
Along with this, the current matching code should be greatly enhanced to be able to fully benefit from the repository of BP files with automatic matching with binary contents (a la "file" unix command)
For now performance tuning has not been a goal for this project, but performance optimization can be considered in the future.
Some algorithmic optimizations have been implemented already where it was thought useful (e.g. bloom filters for efficient search by key).
This is a bit far-fetched, but could be an interesting thing to add. Since the software knows how to browse items, it could also be used to generate browse instructions in a specific language from a higher level browse code relying on the .bp file.
This is a haystack of random features to add to the project. This is NOT an organized or prioritized list.
-
improve error reporting
- notably, add more context information in various places where errors occur, e.g. with bitpunch_error_add_xxx_context() (show expressions from source schema etc.)
-
various structured outputs related to 'xdump' command (e.g. list all fields and their associated hex dump for a given structure, or add a caption line on top of data lines telling which field they represent)
-
implement 'in' keyword, existence test (in indexed arrays, or in all arrays which value type can be compared directly)
-
implement 'flags' interpreter => can display flag's textual representation, and possibly warn about unknown flags
-
implement 'enum' interpreter => can display enum's textual representation, and possibly warn about unknown enum values
-
implement stentil management, in arrays that are or are used as byte arrays through links. This to both benefit from additional structural info (e.g. to limit the available space for a slack container to the first known link pointing later in the array not referring to the same slack container), and to be able to do more sanity checks on boundaries, or to inform about unused space.
-
add support for bit fields
-
CLI
-
add a command to show the target schema of a given dpath expression or type name
-
speed up autocompletion of keys by maintaining a dict of seen keys to show twin index, and using tracker_get_item_key() in place of tracker_get_item_key_multi().
-
display numeric values in a specific format (hex etc.)
-
add dump mode: C initializer
-
-
implement list view (indexed access but list structure on storage)
-
define syntax more precisely
-
should be based on an expression defining the link from an item to the next one, and a stop condition
-
-
implement switch/case statement, as a series of 'if' internally (implementation will be naturally efficient by the dispatch of conditions per block statement)
-
param to choose how sign is encoded in integers (2's complement / sign bit)
-
filter for PCAP files
-
box_get_n_items__slice_generic() can be inefficient (and catch further errors in the array too early) by doing a browse of all items in the array in case the array is slack
- if the array is slack, browsing the slice only should be good (to know for sure the number of valid items)
-
in py.test test files: parse a syntax for data blob description so that actual byte offsets can be registered and tested thereafter
- e.g. "00 01 02 {offset:atX} 03 04" then in test: data.offsets.atX == 3 assert model.get_location(foo) == data.offsets.atX
BUG: filters declared after 'key' statement are not initialized correctly (b_filter is empty)
-
insert type ITEM_ARRAY to distinguish with BYTE_ARRAY after compilation
-
rework handling of field alignment
-
detect size dependencies between packed fields to know their alignment compatibility
-
compile-time errors when alignment is incorrect (unsatisfiable size dependencies)
-
use of proper alignment at runtime when conditionals change alignment
-
-
add properties to arrays, or even arbitrary types:
- e.g. myfield: [size] byte { @minspan: 30; }; myfield: MyStruct { @maxspan: 60; };
-
ability to compare reference to objects with equal/notequal operators
- self == entries['abc']
-
integrate "expr_transform_dpath_internal()" into generic operator evaluation (eval_fn())
-
test index() builtin properly (especially go through expr_dpath_evaluate_filter_type_internal() with CONTAINER dpath type)
-
inline docstrings (ala pydoc)
-
check tests: change .key_type to .key: { .type }, same for .value_type
-
BUG: track path dump of tracker in a slice should substract the slice's start index
-
BUG: mandatory filter attributes are not checked when filter is declared without a block scope {...}
-
BUG: "no match for operator 'add' with operands of type 'any value-type' and 'integer'"
-
modify env() builtin to access the actual environment of the running process, not an artificial environment (previously used to provision data sources)
-
box->scope should be replaced by "ast_node->scope", because the scope is defined at compile time. That will allow resolve phase to work on filter expressions that do not have data context.
-
TEST: add a test similar to test_slack_static_span, with a byte array having a minspan attribute (suspecting issue with ITEMFLAG_FILLS_SLACK being set erroneously)
-
implement array attributes "@minlength" and "@maxlength": replace @length in a similar way than @minspan and @maxspan replace @span. Shortcut can be declaring an array as "[min..max] type";
-> one use case: when specifying trailing bytes like "[..10] byte", or any trailing array with fixed-size items, can help to determine length of preceding slack array by binding its end offset to be inside a range "min_span_end_offset..max_span_end_offset".
-
alias filter's standard input and output as @in and @out attributes, respectively
-> then, redesign the filter read_func() prototype so it does not provide "data" and "size" directly but make the various data sources available to the filter through "filter_evaluate_attribute_internal()" calls, typically passing "@in" to get the main input data source.
-
redesign the internal API around those backend functions to simplify:
- filter.read_value()
- filter.read_func()
- box.get_used_size()
- item.compute_item_size()
-
remove the need to keep ARRAY and BYTE special filter enums
-
replace use of BITPUNCH_NO_ITEM by BITPUNCH_NO_ATTR when looking up an attribute that does not exist (e.g. with filter_evaluate_attribute_internal()), this to avoid mapping different processing types to the same error.
-
implement both endians in compute_item_size__varint() function and add tests for it
-
BUG: "self" expression should not be compiled in resolve phase
-
TODO builtin_filter_declare() should return a bitpunch_status_t