Skip to content

Latest commit

 

History

History
54 lines (39 loc) · 2.18 KB

README.md

File metadata and controls

54 lines (39 loc) · 2.18 KB

World's Fastest SAS7BDAT Parser

Benchmarks

Based on some public and private benchmarks, this library reaches ~ 300–500 MB/s on single M1 MacBook Pro core.

Times slower Correctness
This library 1x (See roadmap)
cpp-sas7bdat 3x ?
pandas.read_sas in Pandas 2.0 10–20x Less correct
pandas.read_sas in Pandas 1.5 15–30x Less correct
pandas.read_sas in Pandas 1.4 n/a Broken
pyreadstat 5–10x
sometimes > 100x
More correct
sas7bdat.py > 100x ?

Usage

Currently only the Pandas compatibility interface is considered stable:

import sas7bdat.pandas_compat

df = sas7bdat.pandas_compat.read_sas("myfile.sas7bdat")

Options to read_sas are the same in pandas.read_sas.

Installation

cd python
python setup.py install

Roadmap

  • Fix all the bugs
  • Parser features:
    • Limiting the number of rows and pages to read
    • Efficient row and page skipping (useful for parallel reading)
    • Reading only a subset of columns
    • Automatically converting to best number type (eg., read integers instead of doubles)
  • Pandas features:
  • New parsers:
    • Parsing directly to (Py)Arrow
    • Parsing directly to Parquet

License

Permission to use, copy, modify, and/or distribute this software for any non-commercial purpose without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.