Releases: h2oai/datatable
Releases · h2oai/datatable
v0.4.0
v0.4.0 — 2018-05-07
Added
- Fread now parses integers with thousands separator (e.g. "1,000").
- Added option
fread.anonymize
which forces fread to anonymize all user input
in the verbose logs / error messages. - Allow type-casts from booleans / integers / floats into strings.
Download links
Release 0.3.2
Added
- Implemented sorting for
str64
columns. - write_csv can now write columns of type
str64
. - Fread can now accept a list of files to read, or a glob pattern.
Fixed
- Fix the source distribution (
sdist
) by including all the files that are
required for building from source. - Install no longer fails with
llvmlite 0.23.0
package.
Release 0.3.1
Added
- Added ability to delete rows from a view Frame.
- Implement countna() function for
obj64
columns. - New option
dt.options.core_logger
to help debug datatable. - New Frame method
.materialize()
to convert a view Frame into a "real" one.
This method is noop if applied to a non-view Frame. - Several internal options to fine-tune the performance of sorting algorithm.
- Significantly improved performance of sorting doubles.
- fread can now read string columns that are larger than 2GB in size.
- fread can now accept a list/tuple of stypes for its
columns
parameter. - improved logic for auto-assigning column names when they are missing.
- fread now supports reading files that contain NUL characters.
- Added global settings
options.frame.names_auto_index
and
options.frame.names_auto_prefix
to control automatic column name
generation in a Frame.
Changed
- When creating a column of "object" type, we will now coerce float "nan"
values intoNone
s. - Renamed fread's parameter
strip_white
intostrip_whitespace
. - Eliminated all
assert()
statements from C code, and replaced them with
exception throws. - Default column names, if none given by the user, are "C0", "C1", "C2", ...
for bothfread
andFrame
constructor. - function-valued
columns
parameter in fread has been changed: if previously
the function was invoked for every column, now it receives the list of all
columns at once, and is expected to return a modified list (or dict / set /
etc). Each column description in the list that the function receives carries
the columns name and stype, in the futureformat
field will also be added.
Fixed
- fread will no longer consume excessive amounts of memory when reading a file
with too many columns and few rows. - fixed a possible crash when reading CSV file containing long string fields.
- fread: NA fields with whitespace were not recognized correctly.
- fread will no longer emit error messages or type-bump variables due to
incorrectly recognized chunk boundaries. - Fixed a crash when rbinding string column with non-string: now an exception
will be thrown instead. - Calling any stats function on a column of obj64 type will no longer result in
a crash. - Columns/rows slices no longer fail on an empty Frame.
- Fixed crash when materializing a view frame containing obj64 columns.
- Fixed erroneous grouping calculations.
- Fixed sorting of 1-row view frames.
Initial alpha release
Added
- Method
df.tonumpy()
now has argumentstype
which will force conversion into
a numpy array of the specific stype. - Enums
stype
andltype
that encapsulate the type-system of thedatatable
module. - It is now possible to fread from a
bytes
object. - Allow columns to be renamed by setting the
names
property on the datatable. - Internal "MemoryMapManager" will make datatable more robust when opening a
frame with many columns on Linux systems. In particular, error 12 "not enough
memory" should become much more rare now. - Number of threads used by fread can now be controlled via parameter
nthreads
. - It is now possible to supply string argument to
dt.DataTable
constructor,
which in turn will try to interpret that argument viafread
. fread
can now read compressed.xz
files.fread
now automatically skips Ctrl+Z / NUL characters at the end of the file.- It is now possible to create a datatable from string numpy array.
- Added parameters
skip_blank_lines
,strip_white
,quotechar
anddec
to fread. - Single-column files with blank lines can now be read successfully.
- Fread now recognizes \r\r\n as a valid line ending.
- Added parameters
url
andcmd
tofread
, as well as ability to detect URLs
automatically. Theurl
parameter downloads file from HTTP/HTTPS/FTP server
into a temporary location and reads it from there. Thecmd
parameter executes
the provided shell command and then reads the data from the stdout. - It is now possible to pass
file
objects tofread
(or any objects exposing
methodread()
). - File path given to
fread
can now transparently select files within .zip archives.
This doesn't work with archives-within-archives. - GenericReader now supports auto-detecting and reading UTF-16 files.
- GenericReader now attempts to detect whether the input file is an HTML, and if so
raises an exception with the appropriate error message. - Datatable can now use either llvm-4.0 or llvm-5.0 depending on what the user has.
- fread now allows
sep=""
, causing the file to be read line-by-line. range
arguments can now be passed to a DataTable constructor.- datatable will now fall back to eager execution if it cannot detect LLVM runtime.
- simple Excel file reader.
- It is now possible to select columns from DataTable by type:
df[int]
selects
all integer columns fromdf
. - Allow creating DataTable from list, while forcing a specific stype(s).
- Added ability to delete rows from a DataTable:
del df[rows, :]
- DataTable can now accept pandas/numpy frames with columns of float16 dtype
(which will be automatically converted to float32). - .isna() function now works on strings too.
.save()
is now a method ofFrame
class.- Warnings now have custom display hook.
- Added global option
nthreads
which control the number of Omp threads used
bydatatable
for parallel execution. Example:dt.options.nthreads = 1
. - Add method
.scalar()
to quickly convert a 1x1 Frame into a python scalar. - New methods
.min1()
,.max1()
,.mean1()
,.sum1()
,.sd1()
,.countna1()
that are similar to.min()
,.max()
, etc. but return a scalar instead of a
Frame (however they only work with a 1-column Frames). - Implemented method
.nunique()
to compute the number of unique values in each
column. - Added stats functions
.mode()
and.nmodal()
.
Changed
- When writing "round" doubles/floats to CSV, they'll now always have trailing zero.
For example, [0.0, 1.0, 1e23] now produce "0.0,1.0,1.0e+23" instead of "0,1,1e+23". df.stypes
now returns a tuple ofstype
elements (previously it was returning
a list of strings). Likewise,df.types
was renamed intodf.ltypes
and now it
returns a tuple ofltype
elements instead of strings.- Parameter
colnames=
in DataTable constructor was renamed tonames=
. The old
parameter may still be used, but it will result in a warning. - DataTable can no longer have duplicate column names. If such names are given,
they will be mangled to make them unique, and a warning will be issued. - Special characters (in the ASCII range
\x00 - \x1F
) are no longer permitted in
the column names. If encountered, they will be replaced with a dot.
. - Fread now ignores trailing whitespace on each line, even if ' ' separator is used.
- Fread on an empty file now produces an empty DataTable, instead of an exception.
- Fread's parameter
skip_lines
was replaced withskip_to_line
, so that it's
more in sync with the similar argumentskip_to_string
. - When saving datatable containing "obj64" columns, they will no longer be saved,
and user warning will be shown (previously saving this column would eventually
lead to a segfault). - (python) DataTable class was renamed into Frame.
- "eager" evaluation engine is now the default.
- Parameter
inplace
of methodrbind()
was removed: instead you can now rbind
frames to an empty frame:dt.Frame().rbind(df1, df2)
.
Fixed
datatable
will no longer cause the C locale settings to change upon importing.- reading a csv file with invalid UTF-8 characters in column names will no longer
throw an exception. - creating a DataTable from pandas.Series with explicit
colnames
will no longer
ignore those column names. - fread(fill=True) will correctly fill missing fields with NAs.
- fread(columns=set(...)) will correctly handle the case when the input contains
multiple columns with the same names. - fread will no longer crash if the input dataset contains invalid utf8/win1252
data in the column headers (#594, #628). - fixed bug in exception handling, which occasionally caused empty exception
messages. - fixed bug in fread where string fields starting with "NaN" caused an assertion error.
- Fixed bug when saving a DataTable with unicode column names into .nff format
on systems where default encoding is not unicode-aware. - More robust newline handling in fread (#634, #641, #647).
- Quoted fields are now correctly unquoted in fread.
- Fixed a bug in fread which occurred if the number of rows in the CSV file was
estimated too low (#664). - Fixed fread bug where an invalid DataTable was constructed if parameter
max_nrows
was used and there were any string columns (#671). - Fixed a rare bug in fread which produced error message "Jump X did not finish
reading where jump X+1 started" (#682). - Prevented memory leak when using "PyObject" columns in conjunction with numpy.
- View frames can now be properly saved.
- Fixed crash when sorting view frame by a string column.
- Deleting 0 columns is no longer an error.
- Rows filter now works properly when applied to a view table and using "eager"
evaluation engine. - Computed columns expression can now be combined with rows expression, or
applied to a view Frame.
Pre-release alpha v0.2.2
Added
- Ability to write DataTable into a CSV file: the
.to_csv()
method. The CSV writer
is multi-threaded and extremely fast. - Added
.internal.column(i).data_pointer
getter, to allow native code from other
libraries to easily access the data in each column. - Fread can now read hexadecimal floating-point numbers: floats and doubles.
- Csv writer will now auto-quote an empty string, and a string containing leading/
trailing whitespace, so that it can be read byfread
reliably. - Fread now prints file sizes in "human-readable" form, i.e. KB/MB/GB instead of bytes.
- Fread can now understand a variety of "NaN" / "Inf" literals produced by different
systems. - Add option
hex
to csv writer, which controls whether floats will be written in
decimal (default) or hexadecimal format. - Csv writer now uses the "dragonfly" algorithm for writing doubles, which is faster
than all known alternatives. - It is now allowed to pass a single-row numpy array as an argument to
dt(rows=...)
,
which will be treated the same as if it was a single-column array. - Now
datatable
's wheel will include librarieslibomp
andlibc++
on the platforms
where they are not widely available. - New
fread
's argumentlogger
allows the user to supply custom logging mechanism to
fread. When this argument is provided, "verbose" mode is turned on automatically.
Changed
datatable
will no longer attempt to distinguish between NA and NAN floating-point values.- Constructing DataTable from a 2D numpy array now preserves shape of that array. At the same
time it is no longer true thatarr.tolist() == numpy.array(DataTable(arr)).tolist()
: the
list will be transposed. - Converting a DataTable into a numpy array now also preserves shape. At the same time it is
no longer true thatdt.topython() == dt.tonumpy().tolist()
: the list will be transposed. - The internal
_datatable
module was moved todatatable.lib._datatable
.
Fixed
datatable
will now convert huge integers into doubleinf
values instead of raising an exception.
Pre-release alpha v0.2.1
Added
- Environmental variable
DTNOOPENMP
will cause thedatatable
to be built without OpenMP support. - If
d0
is a DataTable, thend1 = DataTable(d0)
will create its shallow copy. - In addition to LLVM4 environmental variable, datatable will now also look for the
llvm4
folder
within the package's directory. - Getter
df.internal.rowindex
allows access to the RowIndex on the DataTable (for inspection / reuse). - Implemented statistics
min
,max
,mean
,stdev
,countna
for numeric and boolean columns. - A framework for computing and storin g per-column summary statistics.
sys.getsizeof(dt)
can now be used to query the size of the datatable in memory.- This CHANGELOG file.
Fixed
- Filter function when applied to a view DataTable now produces correct result.
Pre-release alpha
Starting from this point on, all substantially new functionality will be recorded in the CHANGELOG file.