cpp-sas7bdat

Summary

This is a c++ 17 implementation of a SAS7BDAT file reader. This project also provides a python and R interfaces.

This is a toy project with cmake and C++ external polymorphism.

How to compile, tests, ...

make conan-install  # Download and install conan inside your python environment
make                # Compile with cmake and conan
make tests          # Invoke ctest
make tests-python   # Build python package and execute the tests
make tests-R        # Build R package and execute the tests
make conan-package  # Create and install locally the package for conan
make benchmark      # Execute the benchmark
make clean          # Clean the build directory

What is a SAS7BDAT file?

A SAS7BDAT file is a database storage file created by Statistical Analysis System (SAS) software. It contains a binary encoded dataset used for advanced analytics, business intelligence, data management, predictive analytics, and more. (from fileinfo.com)

Different projects already exists to read SAS7BDAT files:

https://github.com/topics/sas7bdat

This project is based on the following implementations:

C++ external polymorphism

This design pattern is very nicely explained in a talk of Klaus Iglberger - Breaking Dependencies: Type Erasure - A Design Analysis (https://www.youtube.com/watch?v=7GIz9SmRgyc)

The idea has been presented in September 1998 in "C++ Report" paper by Chris Cleeland and Douglas C. Schmidt introducing the external polymorphism pattern (https://www.dre.vanderbilt.edu/~schmidt/PDF/C++-EP.pdf).

This pattern allows classes that are not related by inheritance and/or have no virtual methods to be treated polymorphically. This pattern combines C++ language features with patterns like Adapter and Decorator to give the appearance of polymorphic behavior on otherwise unrelated classes.

Summary

A functionality Foo uses a concept X with different methods, i.e. foo(...). The concept X can be seen as an interface that a concrete implementation needs to fulfill. The concrete implementation doesn't have to derive from the interface. The link between the two is done via the model class. Please note that this model class is called adapter in the original paper.

One of the main advantage of this pattern is the complete isolation of each the concrete implementations as they are not linked to a base class, see for example the talk of Sean Parent inheritance is the base class of evil. They only have to expose a set of methods.

This pattern is used at different levels within this package:

data source
dataset sink
selection/filtering of the dataset's columns
dataset's column formatters

Data source

A data source based on std::ifstream is provided in this package:

ifstream.

Dataset sink

3 simple dataset sinks are provided in this package:

print,
csv, and
null.

The first one directly prints the content of the file (header and data) to the screen and the second one is a very basic csv writer (no field protection beside the double quotes, no encoding, ...).

Column filtering

The package provides several filtering options:

Dataset's columns

Each column has a specific type and conversion/format operators. The external polymorphism is used internally to store the exact operator, including endianness, 32/64 bits, ...

Supported types:

string (std::string)
integer (long)
number (double)
datetime (boost::posix_time::ptime)
date (boost::gregorian::date)
time (boost::posix_time::time_duration)

Each formatter class implements one or several getters as well as the to_string method.

Relationship schema

Usage

c++

Here is an example for converting a sas7bdat file to a csv file:

// See for example apps/cppsas7bdat-ci.cpp
#include <cppsas7bdat/reader.hpp>
#include <cppsas7bdat/source/ifstream.hpp>
#include <cppsas7bdat/sink/csv.hpp>

void sas7bdat_to_csv(const char* _filename_sas7bdat,
                     const char* _filename_csv)
{
	std::ofstream csv_os(_filename_csv);
	cppsas7bdat::Reader(
		cppsas7bdat::datasource::ifstream(_filename_sas7bdat), cppsas7bdat::datasink::csv(csv_os)).read_all();
}

It is possible to provide your own data source and sinks:

#include <cppsas7bdat/reader.hpp>

struct MyDataSource {
	MyDataSource(...) { /* ... */ }

	/// This method is called to check if there is any more data
	bool eof() { /* ... */ }
	/// This method is called to read data
	bool read_bytes(void* _p, const size_t _length) { /* ... */ }
};

struct MyDataSink {
	MyDataSink(...) { /* ... */ }

	/// This method is called once the header/metadata is read.
	void set_properties(const cppsas7bdat::Properties& _properties) { /* ... */ }
	/// This method is called for each new row.
	void push_row(const size_t _irow, cppsas7bdat::Column::PBUF _p) { /* ... */ }
	/// This method is called at the end of data
	void end_of_data() { /* ... */ }
};

void read_sas7bdat(...)
{
	cppsas7bdat::Reader reader(MyDataSource(...), MyDataSink(...));

	// Read row by row
	while(reader.read_row());

	// OR Read chunk by chunk
	while(reader.read_rows(chunk_size));

	// OR read the whole file
	reader.read_all();
}

Python

3 sinks -- SinkByRow(), SinkByChunk(chunk_size) and SinkWholeData() -- are provided by the pycppsas7bdat python package. They use pandas.DataFrame to store the data.

from pycppsas7bdat.read_sas import read_sas

sink = read_sas("filename.sas7bdat", include=[...], exclude=[...])
print(sink.properties)
print(sink.df)

from pycppsas7bdat import Reader
from pycppsas7bdat.sink import SinkByRow, SinkByChunk, SinkWholeData

sink = SinkByRow() # or SinkByChunk() or SinkWholeData()   
r = Reader("filename.sas7bdat", sink, include=[...], exclude=[...])
# Read row by row
while r.read_row(): pass

# Read chunk by chunk
while r.read_rows(chunk_size): pass

# OR read the whole file
r.read_all()

# export to pandas.DataFrame
print(sink.properties)
print(sink.df)

It is easy to write your own sinks:

class MySink(object):
    rows = []

    def set_properties(self, properties): # This method must be defined
		"""
		@brief: Called once after reading the header and metadata
		@param properties: A Properties object with the header, metadata and columns definition
		"""
        self.columns = [col.name for col in properties.metadata.columns]

    def push_row(self, irow, row): # This method must be defined
		"""
		@brief: Called for every row
		@param irow: Zero-based index of the row
		@param row: A list of value, one for each column.
		"""
        self.rows.append(row)

class MySinkChunk(object):
	chunks = []
	chunk_size = 10000   # This member must be present for a SinkChunk

	def set_properties(self, properties): # This method must be defined
        """
		@brief: Called once after reading the header and metadata
		@param properties: A Properties object with the header, metadata and columns definition
		"""
		self.columns = [col.name for col in properties.metadata.columns]

    def push_rows(self, istartrow, iendrow, rows): # This method must be defined
		"""
		@brief: Called for every read chunk of data
		@param istartrow: Zero-based index for the start row
		@param iendrow: Zero-based index for the end row (included)
		@param rows: A dict of list of values. The keys are the columns'names.
		"""
		chunks.append(rows)

R

The R package provides a function to directly read a sas7bdat file in a data.frame:

require(CPPSAS7BDAT)

sink <- CPPSAS7BDAT::read_sas("path/to/file.sas7bdat", include=c(...), exclude=c(...));

properties <- sink$properties;
df <- sink$df;

You can also provide your own sink with a R6 class:

require(CPPSAS7BDAT)
library(R6)

MySink <- R6Class("MySink",
     public=list(
        initialize = function() {
        },
        set_properties = function(properties) {
        },
        push_row = function(irow, row) {
        }
     )
);

MySinkChunk <- R6Class("MySinkChunk",
     public=list(
        chunk_size = NULL,
        initialize = function(chunk_size=10000) {
            self$chunk_size = chunk_size;
        },
        set_properties = function(properties) {
        },
        push_rows = function(istartrow, iendrow, rows) {
        }
     )
);

sink <- MySink$new(); # OR MySinkChunk$new(10000);
r <- CPPSAS7BDAT::sas_reader("path/to/file.sas7bdat", sink, include=c(...), exclude=c(...));
r$read_all(); OR r$read_row(); OR r$read_rows(chunk_size)

properties <- sink$properties;
df <- sink$df;

Performance

File	cppsas7bdat -- native ¹	cppsas7bdat -- python ¹	cppsas7bdat -- R ²	SASLib.jl ³	readstat ¹	pandas ¹	sas7bdat -- python ¹
data_AHS2013/topical.sas7bdat ᵃ	0.080 s	0.45 s	0.30s	1.1 s	1.8 s	11s	28 s
data_misc/numeric_1000000_2.sas7bdat ᵇ	0.013 s	0.21 s	0.02s	0.085 s	1.1 s	0.9s	5.5 s

¹ Measurements done with hyperfine
² Measurements done with rbenchmark
³ Measurements done with Julia/BenchmarkTools
ᵃ 13M, 84355 rows x 114 cols
ᵇ 16M, 1000000 rows x 2 cols

Unit tests

The unit tests use more than 170 files from different sources with different encodings, compressions and endianness.

cmake

Inspired from https://github.com/cpp-best-practices/cpp_starter_project

Dependencies

Conan

This install conan inside your python environment and download all the dependencies missing on your system:

pip install conan
conan install conanfile.py --build=missing

Alternatively, you can install the following dependencies manually:

boost-1.71

sudo apt-get install libboost1.71-all-dev

Catch2

git clone https://github.com/catchorg/Catch2 --branch v2.x
cd Catch2
mkdir build; cd build; cmake ..; make; sudo make install

fmt

git clone https://github.com/fmtlib/fmt.git
cd fmt
mkdir build; cd build; cmake ..; make; sudo make install

sdplog

git clone https://github.com/gabime/spdlog.git
cd spdlog
cmake -S . -B ./build -DSPDLOG_FMT_EXTERNAL=ON; cmake --build ./build; cd build; sudo make install

docopt

git clone [email protected]:docopt/docopt.cpp.git
cd docopt.cpp
mkdir build; cd build; cmake ..; make; sudo make install

json

git clone [email protected]:nlohmann/json.git
cd json
mkdir build; cd build; cmake ..; make; sudo make install

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
.github		.github
R		R
apps		apps
benchmark		benchmark
cmake		cmake
include/cppsas7bdat		include/cppsas7bdat
puml		puml
python		python
reports		reports
src		src
test		test
test_package		test_package
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
OpenCppCoverage.cfg		OpenCppCoverage.cfg
README.md		README.md
conanfile.py		conanfile.py
pytest.ini		pytest.ini
setup-conan-github.bash		setup-conan-github.bash
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cpp-sas7bdat

Summary

How to compile, tests, ...

What is a SAS7BDAT file?

C++ external polymorphism

Summary

Data source

Dataset sink

Column filtering

Dataset's columns

Relationship schema

Usage

c++

Python

R

Performance

Unit tests

cmake

Dependencies

Conan

boost-1.71

Catch2

fmt

sdplog

docopt

json

About

Releases 2

Packages

Contributors 2

Languages

License

olivia76/cpp-sas7bdat

Folders and files

Latest commit

History

Repository files navigation

cpp-sas7bdat

Summary

How to compile, tests, ...

What is a SAS7BDAT file?

C++ external polymorphism

Summary

Data source

Dataset sink

Column filtering

Dataset's columns

Relationship schema

Usage

c++

Python

R

Performance

Unit tests

cmake

Dependencies

Conan

boost-1.71

Catch2

fmt

sdplog

docopt

json

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages