Skip to content

Data comments, documentation improvements, and bug fixes

Compare
Choose a tag to compare
@johnkerl johnkerl released this 06 Jan 22:51

Features:

  • Comment strings in data files: mlr --skip-comments allows you to filter out input lines starting with #, for all file formats. Likewise, mlr --skip-comments-with X lets you specify the comment-string X. Comments are only supported at start of data line. mlr --pass-comments and mlr --pass-comments-with X allow you to forward comments to program output as they are read.

  • The count-similar verb lets you compute cluster sizes by cluster labels.

  • While Miller DSL arithmetic gracefully overflows from 64-integer to double-precision float (see also here), there are now the integer-preserving arithmetic operators .+ .- .* ./ .// for those times when you want integer overflow.

  • There is a new bitcount function: for example, echo x=0xf0000206 | mlr put '$y=bitcount($x)' produces x=0xf0000206,y=7.

  • Issue 158: mlr -T is an alias for --nidx --fs tab, and mlr -t is an alias for mlr --tsvlite.

  • The mathematical constants π and e have been renamed from PI and E to M_PI and M_E, respectively. (It's annoying to get a syntax error when you try to define a variable named E in the DSL, when A through D work just fine.) This is a backward incompatibility, but not enough of us to justify calling this release Miller 6.0.0.

Documentation:

  • As noted here, while Miller has its own DSL there will always be things better expressible in a general-purpose language. The new page Sharing data with other languages shows how to seamlessly share data back and forth between Miller, Ruby, and Python. SQL-input examples and SQL-output examples contain detailed information the interplay between Miller and SQL.

  • Issue 150 raised a question about suppressing numeric conversion. This resulted in a new FAQ entry How do I suppress numeric conversion?, as well as the longer-term follow-on issue 151 which will make numeric conversion happen on a just-in-time basis.

  • To my surprise, csvlite format options weren’t listed in mlr --help or the manpage. This has been fixed.

  • Documentation for auxiliary commands has been expanded, including within the manpage.

Bugfixes:

  • Issue 159 fixes regex-match of literal dot.

  • Issue 160 fixes out-of-memory cases for huge files. This is an old bug, as old as Miller, and is due to inadequate testing of huge-file cases. The problem is simple: Miller prefers memory-mapped I/O (using mmap) over stdio since mmap is fractionally faster. Yet as any processing (even mlr cat) steps through an input file, more and more pages are faulted in -- and, unfortunately, previous pages are not paged out once memory pressure increases. (This despite gallant attempts with madvise.) Once all processing is done, the memory is released; there is no leak per se. But the Miller process can crash before the entire file is read. The solution is equally simple: to prefer stdio over mmap for files over 4GB in size. (This 4GB threshold is tunable via the --mmap-below flag as described in the manpage.)

  • Issue 161 fixes a CSV-parse error (with error message "unwrapped double quote at line 0") when a CSV file starts with the UTF-8 byte-order-mark ("BOM") sequence 0xef 0xbb 0xbf and the header line has double-quoted fields. (Release 5.2.0 introduced handling for UTF-8 BOMs, but missed the case of double-quoted header line.)

  • Issue 162 fixes a corner case doing multi-emit of aggregate variables when the first variable name is a typo.

  • The Miller JSON parser used to error with Unable to parse JSON data: Line 1 column 0: Unexpected 0x00 when seeking value on empty input, or input with trailing whitespace; this has been fixed.

There is no prebuilt Windows executable for this release; my apologies.