Data comments, documentation improvements, and bug fixes
Features:
-
Comment strings in data files:
mlr --skip-comments
allows you to filter out input lines starting with#
, for all file formats. Likewise,mlr --skip-comments-with X
lets you specify the comment-stringX
. Comments are only supported at start of data line.mlr --pass-comments
andmlr --pass-comments-with X
allow you to forward comments to program output as they are read. -
The count-similar verb lets you compute cluster sizes by cluster labels.
-
While Miller DSL arithmetic gracefully overflows from 64-integer to double-precision float (see also here), there are now the integer-preserving arithmetic operators
.+
.-
.*
./
.//
for those times when you want integer overflow. -
There is a new bitcount function: for example,
echo x=0xf0000206 | mlr put '$y=bitcount($x)'
producesx=0xf0000206,y=7
. -
Issue 158:
mlr -T
is an alias for--nidx --fs tab
, andmlr -t
is an alias formlr --tsvlite
. -
The mathematical constants π and e have been renamed from
PI
andE
toM_PI
andM_E
, respectively. (It's annoying to get a syntax error when you try to define a variable namedE
in the DSL, whenA
throughD
work just fine.) This is a backward incompatibility, but not enough of us to justify calling this release Miller 6.0.0.
Documentation:
-
As noted here, while Miller has its own DSL there will always be things better expressible in a general-purpose language. The new page Sharing data with other languages shows how to seamlessly share data back and forth between Miller, Ruby, and Python. SQL-input examples and SQL-output examples contain detailed information the interplay between Miller and SQL.
-
Issue 150 raised a question about suppressing numeric conversion. This resulted in a new FAQ entry How do I suppress numeric conversion?, as well as the longer-term follow-on issue 151 which will make numeric conversion happen on a just-in-time basis.
-
To my surprise, csvlite format options weren’t listed in
mlr --help
or the manpage. This has been fixed. -
Documentation for auxiliary commands has been expanded, including within the manpage.
Bugfixes:
-
Issue 159 fixes regex-match of literal dot.
-
Issue 160 fixes out-of-memory cases for huge files. This is an old bug, as old as Miller, and is due to inadequate testing of huge-file cases. The problem is simple: Miller prefers memory-mapped I/O (using
mmap
) overstdio
sincemmap
is fractionally faster. Yet as any processing (evenmlr cat
) steps through an input file, more and more pages are faulted in -- and, unfortunately, previous pages are not paged out once memory pressure increases. (This despite gallant attempts withmadvise
.) Once all processing is done, the memory is released; there is no leak per se. But the Miller process can crash before the entire file is read. The solution is equally simple: to preferstdio
overmmap
for files over 4GB in size. (This 4GB threshold is tunable via the--mmap-below
flag as described in the manpage.) -
Issue 161 fixes a CSV-parse error (with error message "unwrapped double quote at line 0") when a CSV file starts with the UTF-8 byte-order-mark ("BOM") sequence
0xef
0xbb
0xbf
and the header line has double-quoted fields. (Release 5.2.0 introduced handling for UTF-8 BOMs, but missed the case of double-quoted header line.) -
Issue 162 fixes a corner case doing multi-emit of aggregate variables when the first variable name is a typo.
-
The Miller JSON parser used to error with
Unable to parse JSON data: Line 1 column 0: Unexpected 0x00 when seeking value
on empty input, or input with trailing whitespace; this has been fixed.
There is no prebuilt Windows executable for this release; my apologies.