This is a c++ 17 implementation of a SAS7BDAT file reader. This project also provides a python and R interfaces.
This is a toy project with cmake and C++ external polymorphism.
make conan-install # Download and install conan inside your python environment
make # Compile with cmake and conan
make tests # Invoke ctest
make tests-python # Build python package and execute the tests
make tests-R # Build R package and execute the tests
make conan-package # Create and install locally the package for conan
make benchmark # Execute the benchmark
make clean # Clean the build directory
A SAS7BDAT file is a database storage file created by Statistical Analysis System (SAS) software. It contains a binary encoded dataset used for advanced analytics, business intelligence, data management, predictive analytics, and more. (from fileinfo.com)
Different projects already exists to read SAS7BDAT files:
This project is based on the following implementations:
- https://github.com/WizardMac/ReadStat/
- https://github.com/tk3369/SASLib.jl
- https://pypi.org/project/sas7bdat/
This design pattern is very nicely explained in a talk of Klaus Iglberger - Breaking Dependencies: Type Erasure - A Design Analysis (https://www.youtube.com/watch?v=7GIz9SmRgyc)
The idea has been presented in September 1998 in "C++ Report" paper by Chris Cleeland and Douglas C. Schmidt introducing the external polymorphism pattern (https://www.dre.vanderbilt.edu/~schmidt/PDF/C++-EP.pdf).
This pattern allows classes that are not related by inheritance and/or have no virtual methods to be treated polymorphically. This pattern combines C++ language features with patterns like Adapter and Decorator to give the appearance of polymorphic behavior on otherwise unrelated classes.
A functionality Foo
uses a concept X
with different methods, i.e. foo(...)
. The concept X
can be seen as an interface that a concrete implementation needs to fulfill. The concrete implementation doesn't have to derive from the interface. The link between the two is done via the model class. Please note that this model class is called adapter in the original paper.
One of the main advantage of this pattern is the complete isolation of each the concrete implementations as they are not linked to a base class, see for example the talk of Sean Parent inheritance is the base class of evil. They only have to expose a set of methods.
This pattern is used at different levels within this package:
- data source
- dataset sink
- selection/filtering of the dataset's columns
- dataset's column formatters
A data source based on std::ifstream
is provided in this package:
3 simple dataset sinks are provided in this package:
The first one directly prints the content of the file (header and data) to the screen and the second one is a very basic csv writer (no field protection beside the double quotes, no encoding, ...).
The package provides several filtering options:
Each column has a specific type and conversion/format operators. The external polymorphism is used internally to store the exact operator, including endianness, 32/64 bits, ...
Supported types:
- string (
std::string
) - integer (
long
) - number (
double
) - datetime (
boost::posix_time::ptime
) - date (
boost::gregorian::date
) - time (
boost::posix_time::time_duration
)
Each formatter class implements one or several getters as well as the to_string method.
Here is an example for converting a sas7bdat file to a csv file:
// See for example apps/cppsas7bdat-ci.cpp
#include <cppsas7bdat/reader.hpp>
#include <cppsas7bdat/source/ifstream.hpp>
#include <cppsas7bdat/sink/csv.hpp>
void sas7bdat_to_csv(const char* _filename_sas7bdat,
const char* _filename_csv)
{
std::ofstream csv_os(_filename_csv);
cppsas7bdat::Reader(
cppsas7bdat::datasource::ifstream(_filename_sas7bdat), cppsas7bdat::datasink::csv(csv_os)).read_all();
}
It is possible to provide your own data source and sinks:
#include <cppsas7bdat/reader.hpp>
struct MyDataSource {
MyDataSource(...) { /* ... */ }
/// This method is called to check if there is any more data
bool eof() { /* ... */ }
/// This method is called to read data
bool read_bytes(void* _p, const size_t _length) { /* ... */ }
};
struct MyDataSink {
MyDataSink(...) { /* ... */ }
/// This method is called once the header/metadata is read.
void set_properties(const cppsas7bdat::Properties& _properties) { /* ... */ }
/// This method is called for each new row.
void push_row(const size_t _irow, cppsas7bdat::Column::PBUF _p) { /* ... */ }
/// This method is called at the end of data
void end_of_data() { /* ... */ }
};
void read_sas7bdat(...)
{
cppsas7bdat::Reader reader(MyDataSource(...), MyDataSink(...));
// Read row by row
while(reader.read_row());
// OR Read chunk by chunk
while(reader.read_rows(chunk_size));
// OR read the whole file
reader.read_all();
}
3 sinks -- SinkByRow()
, SinkByChunk(chunk_size)
and SinkWholeData()
-- are provided by the
pycppsas7bdat
python package. They use pandas.DataFrame
to store
the data.
from pycppsas7bdat.read_sas import read_sas
sink = read_sas("filename.sas7bdat", include=[...], exclude=[...])
print(sink.properties)
print(sink.df)
from pycppsas7bdat import Reader
from pycppsas7bdat.sink import SinkByRow, SinkByChunk, SinkWholeData
sink = SinkByRow() # or SinkByChunk() or SinkWholeData()
r = Reader("filename.sas7bdat", sink, include=[...], exclude=[...])
# Read row by row
while r.read_row(): pass
# Read chunk by chunk
while r.read_rows(chunk_size): pass
# OR read the whole file
r.read_all()
# export to pandas.DataFrame
print(sink.properties)
print(sink.df)
It is easy to write your own sinks:
class MySink(object):
rows = []
def set_properties(self, properties): # This method must be defined
"""
@brief: Called once after reading the header and metadata
@param properties: A Properties object with the header, metadata and columns definition
"""
self.columns = [col.name for col in properties.metadata.columns]
def push_row(self, irow, row): # This method must be defined
"""
@brief: Called for every row
@param irow: Zero-based index of the row
@param row: A list of value, one for each column.
"""
self.rows.append(row)
class MySinkChunk(object):
chunks = []
chunk_size = 10000 # This member must be present for a SinkChunk
def set_properties(self, properties): # This method must be defined
"""
@brief: Called once after reading the header and metadata
@param properties: A Properties object with the header, metadata and columns definition
"""
self.columns = [col.name for col in properties.metadata.columns]
def push_rows(self, istartrow, iendrow, rows): # This method must be defined
"""
@brief: Called for every read chunk of data
@param istartrow: Zero-based index for the start row
@param iendrow: Zero-based index for the end row (included)
@param rows: A dict of list of values. The keys are the columns'names.
"""
chunks.append(rows)
The R package provides a function to directly read a sas7bdat file in a data.frame:
require(CPPSAS7BDAT)
sink <- CPPSAS7BDAT::read_sas("path/to/file.sas7bdat", include=c(...), exclude=c(...));
properties <- sink$properties;
df <- sink$df;
You can also provide your own sink with a R6 class:
require(CPPSAS7BDAT)
library(R6)
MySink <- R6Class("MySink",
public=list(
initialize = function() {
},
set_properties = function(properties) {
},
push_row = function(irow, row) {
}
)
);
MySinkChunk <- R6Class("MySinkChunk",
public=list(
chunk_size = NULL,
initialize = function(chunk_size=10000) {
self$chunk_size = chunk_size;
},
set_properties = function(properties) {
},
push_rows = function(istartrow, iendrow, rows) {
}
)
);
sink <- MySink$new(); # OR MySinkChunk$new(10000);
r <- CPPSAS7BDAT::sas_reader("path/to/file.sas7bdat", sink, include=c(...), exclude=c(...));
r$read_all(); OR r$read_row(); OR r$read_rows(chunk_size)
properties <- sink$properties;
df <- sink$df;
File | cppsas7bdat -- native ¹ | cppsas7bdat -- python ¹ | cppsas7bdat -- R ² | SASLib.jl ³ | readstat ¹ | pandas ¹ | sas7bdat -- python ¹ |
---|---|---|---|---|---|---|---|
data_AHS2013/topical.sas7bdat ᵃ | 0.080 s | 0.45 s | 0.30s | 1.1 s | 1.8 s | 11s | 28 s |
data_misc/numeric_1000000_2.sas7bdat ᵇ | 0.013 s | 0.21 s | 0.02s | 0.085 s | 1.1 s | 0.9s | 5.5 s |
¹ Measurements done with hyperfine
² Measurements done with rbenchmark
³ Measurements done with Julia/BenchmarkTools
ᵃ 13M, 84355 rows x 114 cols
ᵇ 16M, 1000000 rows x 2 cols
The unit tests use more than 170 files from different sources with different encodings, compressions and endianness.
Inspired from https://github.com/cpp-best-practices/cpp_starter_project
This install conan inside your python environment and download all the dependencies missing on your system:
pip install conan
conan install conanfile.py --build=missing
Alternatively, you can install the following dependencies manually:
sudo apt-get install libboost1.71-all-dev
git clone https://github.com/catchorg/Catch2 --branch v2.x
cd Catch2
mkdir build; cd build; cmake ..; make; sudo make install
git clone https://github.com/fmtlib/fmt.git
cd fmt
mkdir build; cd build; cmake ..; make; sudo make install
git clone https://github.com/gabime/spdlog.git
cd spdlog
cmake -S . -B ./build -DSPDLOG_FMT_EXTERNAL=ON; cmake --build ./build; cd build; sudo make install
git clone [email protected]:docopt/docopt.cpp.git
cd docopt.cpp
mkdir build; cd build; cmake ..; make; sudo make install
git clone [email protected]:nlohmann/json.git
cd json
mkdir build; cd build; cmake ..; make; sudo make install