-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathREADME.Rmd
760 lines (619 loc) · 27.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
---
output:
md_document:
variant: markdown_github
includes:
in_header: header.md
toc: true
toc_depth: 4
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(digits = 4)
```
# Setup
## Software Libraries
## Mac OS X
```bash
## To install Homebrew:
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
## Then:
brew tap homebrew/science && brew update
# brew install gcc --enable-cxx && brew link --overwrite gcc && brew link cmake
brew install boost --c++11
# Installing cmake may require: sudo chown -R `whoami` /usr/local/share/man/man7
brew install mlpack
```
May need to `brew reinstall --build-from-source mlpack` if you get "Reason: image not found" error about lack of Armadillo library. When I came back to these notes, Armadillo has been updated to v8 from v7 on my system and that was causing issues.
## Ubuntu/Debian
```bash
sudo apt-get install libmlpack-dev libdlib-dev
```
## R Packages
```R
install.packages(c(
"BH", # Header files for 'Boost' C++ library
"Rcpp", # R and C++ integration
"RcppArmadillo", # Rcpp integration for 'Armadillo' linear algebra library
"Rcereal", # header files of 'cereal', a C++11 library for serialization
"microbenchmark", # For benchmarking performance
"devtools", # For installing packages from GitHub
"magrittr", # For piping
"knitr", # For printing tables & data.frames as Markdown
"toOrdinal" # Cardinal to ordinal number conversion function (e.g. 1 => "1st")
), repos = "https://cran.rstudio.com/")
devtools::install_github("yihui/printr") # Prettier table printing
```
If you get "ld: library not found for -lgfortran" error when trying to install RcppArmadillo, run:
```bash
curl -O http://r.research.att.com/libs/gfortran-4.8.2-darwin13.tar.bz2
sudo tar fvxz gfortran-4.8.2-darwin13.tar.bz2 -C /
```
See "[Rcpp, RcppArmadillo and OS X Mavericks "-lgfortran" and "-lquadmath" error](http://thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks-lgfortran-and-lquadmath-error/)"" for more info.
# Rcpp
See [this section](http://rmarkdown.rstudio.com/authoring_knitr_engines.html#rcpp) in [RMarkdown documentation](http://rmarkdown.rstudio.com/) for details on Rcpp chunks.
```{r load_pkg}
library(magrittr)
library(BH)
library(Rcpp)
library(RcppArmadillo)
library(Rcereal)
library(microbenchmark)
library(knitr)
library(printr)
```
## Basics
```{Rcpp x2_source, cache = TRUE}
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector cumSum(NumericVector x) {
int n = x.size();
NumericVector out(n);
out[0] = x[0];
for (int i = 1; i < n; ++i) {
out[i] = out[i-1] + x[i];
}
return out;
}
```
```{r x2_example, dependson = 'x2_source', cache = TRUE}
x <- 1:1000
microbenchmark(
native = cumsum(x),
loop = (function(x) {
output <- numeric(length(x))
output[1] <- x[1]
for (i in 2:length(x)) {
output[i] <- output[i - 1] + x[i]
}
return(output)
})(x),
Rcpp = cumSum(x)
) %>% summary(unit = "ms") %>% knitr::kable(format = "markdown")
```
## Using Libraries
### Armadillo vs RcppArmadillo
Use the **depends** attribute to bring in [RcppArmadillo](https://cran.r-project.org/package=RcppArmadillo), which is an Rcpp integration of the templated linear algebra library [Armadillo](http://arma.sourceforge.net/). The code below is an example of a fast linear model from Dirk Eddelbuettel.
```{Rcpp fastlm_source, cache = TRUE}
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
List fastLm(const colvec& y, const mat& X) {
int n = X.n_rows, k = X.n_cols;
colvec coef = solve(X, y);
colvec resid = y - X*coef;
double sig2 = as_scalar(trans(resid) * resid/(n-k));
colvec stderrest = sqrt(sig2 * diagvec( inv(trans(X)*X)) );
return List::create(_["coefficients"] = coef,
_["stderr"] = stderrest,
_["df.residual"] = n - k );
}
```
```{r fastlm_example, dependson = 'fastlm_source', cache = TRUE}
data("mtcars", package = "datasets")
microbenchmark(
lm = lm(mpg ~ wt + disp + cyl + hp, data = mtcars),
fastLm = fastLm(mtcars$mpg, cbind(1, as.matrix(mtcars[, c("wt", "disp", "cyl", "hp")]))),
RcppArm = fastLmPure(cbind(1, as.matrix(mtcars[, c("wt", "disp", "cyl", "hp")])), mtcars$mpg)
) %>% summary(unit = "ms") %>% knitr::kable(format = "markdown")
```
### Fast K-Means
Unfortunately, [RcppMLPACK](https://cran.r-project.org/package=RcppMLPACK) uses version 1 of [MLPACK](http://www.mlpack.org/) (now in version 2) and only makes the unsupervised learning methods accessible. (Supervised methods would require returning a trained classifier object to R, which is actually a really difficult problem.)
Okay, let's try to get a fast version of <span title = "Bradley, P. S., & Fayyad, U. M. (1998). Refining Initial Points for K-Means Clustering. Icml." style = "font-weight: bold;">k-means</span>.
First, install the MLPACK library (see [§ Software Libraries](#software-libraries)), then:
```{r mlpack11, cache = FALSE}
# Thanks to Kevin Ushey for suggesting Rcpp plugins (e.g. Rcpp:::.plugins$openmp)
registerPlugin("mlpack11", function() {
return(list(env = list(
USE_CXX1X = "yes",
CXX1XSTD = "-std=c++11",
PKG_LIBS = "-lmlpack"
)))
})
```
We refer to the documentation for [KMeans](http://www.mlpack.org/docs/mlpack-1.0.6/doxygen.php?doc=kmtutorial.html#kmeans_kmtut) shows, although it seems to incorrectly use `arma::Col<size_t>` for cluster assigments while in practice the cluster assignments are returned as an `arma::Row<size_t>` object.
```{Rcpp kmeans_source, cache = TRUE}
// [[Rcpp::plugins(mlpack11)]]
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
#include <mlpack/core/util/log.hpp>
#include <mlpack/methods/kmeans/kmeans.hpp>
using namespace mlpack::kmeans;
using namespace arma;
// [[Rcpp::export]]
NumericVector mlpackKM(const arma::mat& data, const size_t& clusters) {
Row<size_t> assignments;
KMeans<> k;
k.Cluster(data, clusters, assignments);
// Let's change the format of the output to be a little nicer:
NumericVector results(data.n_cols);
for (int i = 0; i < assignments.n_cols; i++) {
results[i] = assignments(i) + 1; // cluster assignments are 0-based
}
return results;
}
```
(Alternatively: `sourceCpp("src/fastKm.cpp")` which creates `fastKm()` from [src/fastKm.cpp](src/fastKm.cpp))
Getting:
```
Symbol not found: __ZN6mlpack3Log6AssertEbRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
Expected in: flat namespace
```
Which (via `c++filt __ZN6mlpack3Log6AssertEbRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE`) translates to: "mlpack::Log::Assert(bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)"
```{r kmeans_example, dependson = 'kmeans_source', cache = TRUE}
data(trees, package = "datasets"); data(faithful, package = "datasets")
# kmeans coerces data frames to matrix, so it's worth doing that beforehand
mtrees <- as.matrix(trees)
mfaithful <- as.matrix(faithful)
# KMeans in MLPACK requires observations to be in columns, not rows:
ttrees <- t(trees); tfaithful <- t(faithful)
microbenchmark(
kmeans_trees = kmeans(mtrees, 3),
mlpackKM_trees = mlpackKM(ttrees, 3),
kmeans_faithful = kmeans(mfaithful, 2),
mlpackKM_faithful = mlpackKM(tfaithful, 2)
) %>% summary(unit = "ms") %>% knitr::kable(format = "markdown")
```
## Fast Classification
In this exercise, we will train a [Naive Bayes classifier from MLPACK](http://www.mlpack.org/docs/mlpack-2.1.0/doxygen.php?doc=classmlpack_1_1naive__bayes_1_1NaiveBayesClassifier.html). First, we train and classify in a single step. Then we will store the trained classifier in memory, and then later we will be able to save the model. Storing the trained model requires [serialization](https://en.wikipedia.org/wiki/Serialization), the topic of the next section.
**Update (2017-09-05)**: there is now -- apparently -- a [RcppMLPACK2](https://github.com/rcppmlpack/rcppmlpack2). For more details, refer to [RcppMLPACK2 and the MLPACK Machine Learning Library article](http://gallery.rcpp.org/articles/using-rcppmlpack2/).
```{Rcpp nbc_source, cache = TRUE}
// [[Rcpp::plugins(mlpack11)]]
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
#include <mlpack/core/util/log.hpp>
#include <mlpack/methods/naive_bayes/naive_bayes_classifier.hpp>
using namespace mlpack::naive_bayes;
// [[Rcpp::export]]
NumericVector mlpackNBC(const arma::mat& training_data, const arma::Row<size_t>& labels, const size_t& classes, const arma::mat& new_data) {
// Initialization & training:
NaiveBayesClassifier<> nbc(training_data, labels, classes);
// Prediction:
arma::Row<size_t> predictions;
nbc.Classify(new_data, predictions);
// Let's change the format of the output to be a little nicer:
NumericVector results(predictions.n_cols);
for (int i = 0; i < predictions.n_cols; ++i) {
results[i] = predictions(i);
}
return results;
}
```
For the classification example, we'll use the Iris dataset. (Of course.)
```{r iris, cache = TRUE}
data(iris, package = "datasets")
set.seed(0)
training_idx <- sample.int(nrow(iris), 0.8 * nrow(iris), replace = FALSE)
training_x <- unname(as.matrix(iris[training_idx, 1:4]))
training_y <- unname(iris$Species[training_idx])
testing_x <- unname(as.matrix(iris[-training_idx, 1:4]))
testing_y <- unname(iris$Species[-training_idx])
# For fastNBC:
ttraining_x <- t(training_x)
ttraining_y <- matrix(as.numeric(training_y) - 1, nrow = 1)
classes <- length(levels(training_y))
ttesting_x <- t(testing_x)
ttesting_y <- matrix(as.numeric(testing_y) - 1, nrow = 1)
```
I kept getting "Mat::col(): index out of bounds" error when trying to compile. I debugged the heck out of it until I finally looked in **naive_bayes_classifier_impl.hpp** and saw:
```cpp
for (size_t j = 0; j < data.n_cols; ++j)
{
const size_t label = labels[j];
++probabilities[label];
arma::vec delta = data.col(j) - means.col(label);
means.col(label) += delta / probabilities[label];
variances.col(label) += delta % (data.col(j) - means.col(label));
}
```
Hence why we run into a problem when we use `as.numeric(training_y)` in R and turn that factor into 1s, 2s, and 3s. This makes sense in retrospect but would have been nice to explicitly know that MLPACK expects training data class labels to be 0-based.
```{r nbc_example, dependson = c('nbc_source', 'iris'), cache = TRUE}
# Naive Bayes via e1071
naive_bayes <- e1071::naiveBayes(training_x, training_y)
predictions <- e1071:::predict.naiveBayes(naive_bayes, testing_x, type = "class")
confusion_matrix <- caret::confusionMatrix(
data = predictions,
reference = testing_y
)
confusion_matrix$table
print(confusion_matrix$overall["Accuracy"])
# Naive Bayes via MLPACK
predictions <- mlpackNBC(ttraining_x, ttraining_y, classes, ttesting_x)
confusion_matrix <- caret::confusionMatrix(
data = predictions,
reference = ttesting_y
)
confusion_matrix$table
print(confusion_matrix$overall["Accuracy"])
# Performance Comparison
microbenchmark(
naiveBayes = {
naive_bayes <- e1071::naiveBayes(training_x, training_y)
predictions <- e1071:::predict.naiveBayes(naive_bayes, testing_x, type = "class")
},
fastNBC = mlpackNBC(ttraining_x, ttraining_y, classes, ttesting_x)
) %>% summary(unit = "ms") %>% knitr::kable(format = "markdown")
```
### External Pointers
In the next step, we'll train a Naive Bayes classifier and keep that trained object in memory to make classification a separate step. Notice that we have to:
- declare a pointer: `NaiveBayesClassifier<>* nbc = new NaiveBayesClassifier<>(...)`
- use Rcpp's external pointers (`Rcpp::XPtr`) and
- return an [S-expression](http://adv-r.had.co.nz/C-interface.html#c-data-structures) (`SEXP`).
```{Rcpp nb_train_source, cache = TRUE}
// [[Rcpp::plugins(mlpack11)]]
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
#include <mlpack/core/util/log.hpp>
#include <mlpack/methods/naive_bayes/naive_bayes_classifier.hpp>
using namespace mlpack::naive_bayes;
// [[Rcpp::export]]
SEXP mlpackNBTrainXPtr(const arma::mat& training_data, const arma::Row<size_t>& labels, const size_t& classes) {
// Initialization & training:
NaiveBayesClassifier<>* nbc = new NaiveBayesClassifier<>(training_data, labels, classes);
Rcpp::XPtr<NaiveBayesClassifier<>> p(nbc, true);
return p;
}
```
```{r nbc_example_2a, dependson = 'nbc_example', cache = FALSE}
fit <- mlpackNBTrainXPtr(ttraining_x, ttraining_y, classes)
str(fit)
```
`fit` is an external pointer to some memory. When we pass it to a C++ function, it's passed as an R data type (SEXP) that we have to convert to an external pointer before we can use the object's methods. Notice that we're now calling `nbc->Classify()` instead of `nbc.Classify()`.
```{Rcpp nb_classify_source, dependson = 'nb_train_source', cache = TRUE}
// [[Rcpp::plugins(mlpack11)]]
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
#include <mlpack/core/util/log.hpp>
#include <mlpack/methods/naive_bayes/naive_bayes_classifier.hpp>
using namespace mlpack::naive_bayes;
// [[Rcpp::export]]
NumericVector mlpackNBClassifyXPtr(SEXP xp, const arma::mat& new_data) {
XPtr<NaiveBayesClassifier<>> nbc(xp);
// Prediction:
arma::Row<size_t> predictions;
nbc->Classify(new_data, predictions);
// Let's change the format of the output to be a little nicer:
NumericVector results(predictions.n_cols);
for (int i = 0; i < predictions.n_cols; ++i) {
results[i] = predictions(i);
}
return results;
}
```
```{r nbc_example_2b, dependson = c('nb_classify_source', 'nbc_example_2a'), cache = FALSE}
fit_e1071 <- e1071::naiveBayes(training_x, training_y)
# Performance Comparison
microbenchmark(
`e1071 prediction` = e1071:::predict.naiveBayes(fit_e1071, testing_x, type = "class"),
`MLPACK prediction` = mlpackNBClassifyXPtr(fit, ttesting_x)
) %>% summary(unit = "ms") %>% knitr::kable(format = "markdown")
```
See [Exposing C++ functions and classes with Rcpp modules](http://dirk.eddelbuettel.com/code/rcpp/Rcpp-modules.pdf) for more information.
## Random Number Generation
[Rcpp sugar](http://adv-r.had.co.nz/Rcpp.html#rcpp-sugar) provides a bunch of useful features and high-level abstractions, such as statistical distribution functions. Let's create a function that yields a sample of _n_ independent draws from Bernoulli(*p*):
```{Rcpp rng_source, cache = TRUE}
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector bernie(const int& n, const float& p) {
return rbinom(n, 1, p);
}
```
> Section 6.3 of [Writing R Extensions](http://cran.r-project.org/doc/manuals/r-release/R-exts.html#Random-number-generation) describes an additional requirement for calling the R random number generation functions: you must call GetRNGState prior to using them and then PutRNGState afterwards. These functions (respectively) read .Random.seed and then write it out after use. When using Rcpp attributes (as we do via the // [[Rcpp::export]] annotation on the functions above) it is not necessary to call GetRNGState and PutRNGState because this is done automatically within the wrapper code generated for exported functions. In fact, since these calls don’t nest it is actually an error to call them when within a function exported via Rcpp attributes. ([Random number generation](http://gallery.rcpp.org/articles/random-number-generation/))
Let's just do a quick test to see if setting seed does what it should:
```{r rng_example, dependson = 'rng_source', cache = TRUE}
n <- 100; p <- 0.5; x <- list()
set.seed(0); x[[1]] <- bernie(n, p)
set.seed(0); x[[2]] <- bernie(n, p)
set.seed(42); x[[3]] <- bernie(n, p)
set.seed(42); x[[4]] <- bernie(n, p)
```
```{r rng_results, dependson = 'rng_example', cache = TRUE, results = 'asis', echo = FALSE}
y <- matrix(logical(length(x)^2), nrow = length(x))
for (i in seq_along(x)) {
y[i, ] <- vapply(x, function(y) {
return(identical(x[[i]], y))
}, TRUE)
}
rownames(y) <- colnames(y) <- paste0("Seed: ", c(rep(0, 2), rep(42, 2)), ", ", vapply(1:2, toOrdinal::toOrdinal, ""), " draw")
y[y] <- "All draws match"
y[y == "FALSE"] <- "Draws don't match"
knitr::kable(y, format = "markdown")
```
Let's see how the performance differs between it and an R-equivalent when _n_=1,000:
```{r rng_benchmark, dependson = 'rng_source', cache = TRUE}
rbern <- function(n, p) {
return(rbinom(n, 1, p))
}
microbenchmark(
R = rbern(1e3, 0.49),
Rcpp = bernie(1e3, 0.49)
) %>% summary(unit = "ms") %>% knitr::kable(format = "markdown")
```
## Serialization
[Serialization](https://en.wikipedia.org/wiki/Serialization) is the process of translating data structures and objects into a format that can be stored. Earlier, we trained a Naive Bayes classifier and kept the trained object in memory, returning an external pointer to it, allowing us to classify new observations as long as it is done within the same session.
This one requires: C++11 capability (if compiled supports it, enable via `// [[Rcpp::plugins(cpp11)]]`) and [cereal](http://uscilab.github.io/cereal/) serialization library, available via [Rcereal package](https://cran.r-project.org/package=Rcereal).
Roughly, we're going to create a wrapper for NumericMatrix that is serializable. **Note**: unlike previous section where the Rcpp chunks were complete, this section has multiple Rcpp chunks that are stitched together using the "ref.label" knitr chunk parameter, allowing me to have notes in-between different functions that should be together. See [Combining Chunks](http://rmarkdown.rstudio.com/authoring_knitr_engines.html#combining-chunks) for more details.
```{Rcpp cerealization, ref.label = c('myclass_definition', 'myclass_test', 'myclass_serialization', 'myclass_deserialization', 'cerealizations_source'), cache = TRUE, include = FALSE}
```
```{Rcpp myclass_definition, eval = FALSE}
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(Rcereal)]]
// Enables us to keep serialization method internally:
#include <cereal/access.hpp>
// see http://uscilab.github.io/cereal/serialization_functions.html#non-public-serialization
// Use std::vector and make it serializable:
#include <vector>
#include <cereal/types/vector.hpp>
// see http://uscilab.github.io/cereal/stl_support.html for more info
// Cereal's binary archiving:
#include <sstream>
#include <cereal/archives/binary.hpp>
// see http://uscilab.github.io/cereal/serialization_archives.html
// and http://uscilab.github.io/cereal/quickstart.html for more info
#include <Rcpp.h>
using namespace Rcpp;
class SerializableNumericMatrix
{
private:
int nrows;
int ncols;
std::vector<double> data;
friend class cereal::access;
// This method lets cereal know which data members to serialize
template<class Archive>
void serialize(Archive& ar)
{
ar( data, nrows, ncols ); // serialize things by passing them to the archive
}
public:
SerializableNumericMatrix(){};
SerializableNumericMatrix(NumericMatrix x) {
std::vector<double> y = as<std::vector<double>>(x); // Rcpp::as
data = y;
nrows = x.nrow();
ncols = x.ncol();
};
NumericVector neuemericMatrix() {
// Thanks to Kevin Ushey for the tip to use dim attribute
// http://stackoverflow.com/a/19866956/1091835
NumericVector d = wrap(data);
d.attr("dim") = Dimension(nrows, ncols);
return d;
}
NumericMatrix numericMatrix() {
NumericMatrix d(nrows, ncols, data.begin());
return d;
}
};
```
Let's just test that everything works before adding [de-]serialization capabilities.
```{Rcpp myclass_test, eval = FALSE}
// [[Rcpp::export]]
NumericMatrix testSNM(NumericMatrix x) {
SerializableNumericMatrix snm(x);
return snm.numericMatrix();
}
```
```{r myclass_example, dependson = 'cerealization', cache = TRUE}
x <- matrix(0:9, nrow = 2, ncol = 5)
(y <- testSNM(x))
```
Yay! Okay, for this we are going to use the serialization/deserialization example from [Rcereal README.md](https://github.com/wush978/Rcereal/blob/master/README.md).
```{Rcpp myclass_serialization, eval = FALSE}
// [[Rcpp::export]]
RawVector serializeSNM(NumericMatrix x) {
SerializableNumericMatrix snm(x);
std::stringstream ss;
{
cereal::BinaryOutputArchive oarchive(ss); // Create an output archive
oarchive(snm);
}
ss.seekg(0, ss.end);
RawVector retval(ss.tellg());
ss.seekg(0, ss.beg);
ss.read(reinterpret_cast<char*>(&retval[0]), retval.size());
return retval;
}
```
```{Rcpp myclass_deserialization, eval = FALSE}
//[[Rcpp::export]]
NumericVector deserializeSNM(RawVector src) {
std::stringstream ss;
ss.write(reinterpret_cast<char*>(&src[0]), src.size());
ss.seekg(0, ss.beg);
SerializableNumericMatrix snm;
{
cereal::BinaryInputArchive iarchive(ss);
iarchive(snm);
}
return snm.numericMatrix();
}
```
```{r cerealization_example, dependson = c('myclass_example', 'cerealization'), cache = TRUE}
(raw_vector <- serializeSNM(x))
deserializeSNM(raw_vector)
```
### Armadillo Serialization (Not Working)
Basically, we want to be able to serialize Armadillo matrices because then we can serialize things like the Naive Bayes classifier that rely on `arma::Mat`'s.
```{r}
# Enables us to include files in src/
registerPlugin("local", function() {
return(list(env = list(
PKG_CXXFLAGS = paste0('-I"', getwd(), '/src"')
)))
})
```
```{Rcpp boosted_serialization, ref.label = c('bs_includes', 'bs_boost_serialization_includes', 'bs_RcppArmadilloForward', 'bs_armadillo_include', 'bs_boost_archive_includes', 'bs_mlpack_includes', 'bs_ns', 'bs_test_src', 'bs_serialize_src', 'bs_deserialize_src'), cache = TRUE, include = FALSE}
```
```{Rcpp bs_includes, eval = FALSE}
// [[Rcpp::plugins(local)]]
// [[Rcpp::plugins(mlpack11)]]
// [[Rcpp::depends(BH)]]
// [[Rcpp::depends(RcppArmadillo)]]
```
Include everything we'll need for `serialize()`:
```{Rcpp bs_boost_serialization_includes, eval = FALSE}
#include <boost/serialization/serialization.hpp>
#include <boost/serialization/nvp.hpp>
#include <boost/serialization/array.hpp>
```
In the next chunk we copy and modify the `RcppArmadillo__RcppArmadilloForward__h` definition from **RcppArmadilloForward.h** so as to use our own Mat extra proto and meat, which is really just:
```cpp
//! Add a serialization operator.
template<typename Archive>
void serialize(Archive& ar, const unsigned int version);
#include <RcppArmadillo/Mat_proto.h>
```
The **Mat_extra_bones.hpp** is included for serialization and **Mat_proto.h** is included so RcppArmadillo plays nice.
```{Rcpp bs_RcppArmadilloForward, eval = FALSE}
#ifndef RcppArmadillo__RcppArmadilloForward__h
#define RcppArmadillo__RcppArmadilloForward__h
#include <RcppCommon.h>
#include <Rconfig.h>
#include <RcppArmadilloConfig.h>
// Costom Mat extension that combines MLPACK with RcppArmadillo's Mat extensions:
#define ARMA_EXTRA_MAT_PROTO mat_extra_bones.hpp
#define ARMA_EXTRA_MAT_MEAT mat_extra_meat.hpp
// Everything else the same:
#define ARMA_EXTRA_COL_PROTO RcppArmadillo/Col_proto.h
#define ARMA_EXTRA_COL_MEAT RcppArmadillo/Col_meat.h
#define ARMA_EXTRA_ROW_PROTO RcppArmadillo/Row_proto.h
#define ARMA_EXTRA_ROW_MEAT RcppArmadillo/Row_meat.h
#define ARMA_RNG_ALT RcppArmadillo/Alt_R_RNG.h
#include <armadillo>
/* forward declarations */
namespace Rcpp {
/* support for wrap */
template <typename T> SEXP wrap ( const arma::Mat<T>& ) ;
template <typename T> SEXP wrap ( const arma::Row<T>& ) ;
template <typename T> SEXP wrap ( const arma::Col<T>& ) ;
template <typename T> SEXP wrap ( const arma::field<T>& ) ;
template <typename T> SEXP wrap ( const arma::Cube<T>& ) ;
template <typename T> SEXP wrap ( const arma::subview<T>& ) ;
template <typename T> SEXP wrap ( const arma::SpMat<T>& ) ;
template <typename T1, typename T2, typename glue_type>
SEXP wrap(const arma::Glue<T1, T2, glue_type>& X ) ;
template <typename T1, typename op_type>
SEXP wrap(const arma::Op<T1, op_type>& X ) ;
template <typename T1, typename T2, typename glue_type>
SEXP wrap(const arma::eGlue<T1, T2, glue_type>& X ) ;
template <typename T1, typename op_type>
SEXP wrap(const arma::eOp<T1, op_type>& X ) ;
template <typename T1, typename op_type>
SEXP wrap(const arma::OpCube<T1,op_type>& X ) ;
template <typename T1, typename T2, typename glue_type>
SEXP wrap(const arma::GlueCube<T1,T2,glue_type>& X ) ;
template <typename T1, typename op_type>
SEXP wrap(const arma::eOpCube<T1,op_type>& X ) ;
template <typename T1, typename T2, typename glue_type>
SEXP wrap(const arma::eGlueCube<T1,T2,glue_type>& X ) ;
template<typename out_eT, typename T1, typename op_type>
SEXP wrap( const arma::mtOp<out_eT,T1,op_type>& X ) ;
template<typename out_eT, typename T1, typename T2, typename glue_type>
SEXP wrap( const arma::mtGlue<out_eT,T1,T2,glue_type>& X );
template <typename eT, typename gen_type>
SEXP wrap( const arma::Gen<eT,gen_type>& X) ;
template<typename eT, typename gen_type>
SEXP wrap( const arma::GenCube<eT,gen_type>& X) ;
namespace traits {
/* support for as */
template <typename T> class Exporter< arma::Mat<T> > ;
template <typename T> class Exporter< arma::Row<T> > ;
template <typename T> class Exporter< arma::Col<T> > ;
template <typename T> class Exporter< arma::SpMat<T> > ;
template <typename T> class Exporter< arma::field<T> > ;
// template <typename T> class Exporter< arma::Cube<T> > ;
} // namespace traits
template <typename T> class ConstReferenceInputParameter< arma::Mat<T> > ;
template <typename T> class ReferenceInputParameter< arma::Mat<T> > ;
template <typename T> class ConstInputParameter< arma::Mat<T> > ;
template <typename T> class ConstReferenceInputParameter< arma::Col<T> > ;
template <typename T> class ReferenceInputParameter< arma::Col<T> > ;
template <typename T> class ConstInputParameter< arma::Col<T> > ;
template <typename T> class ConstReferenceInputParameter< arma::Row<T> > ;
template <typename T> class ReferenceInputParameter< arma::Row<T> > ;
template <typename T> class ConstInputParameter< arma::Row<T> > ;
}
#endif
```
Okay, now that that's been defined, let's include **RcppArmadillo.h** which will include **RcppForward.h** but which will (hopefully) *not* redefine our slightly customized `RcppArmadillo__RcppArmadilloForward__h`.
```{Rcpp bs_armadillo_include, eval = FALSE}
#include <RcppArmadillo.h>
```
MLPACK's Naive Bayes Classifier:
```{Rcpp bs_mlpack_includes, eval = FALSE}
#include <mlpack/core/util/log.hpp>
#include <mlpack/methods/naive_bayes/naive_bayes_classifier.hpp>
using namespace mlpack::naive_bayes;
```
Boost's Archives:
```{Rcpp bs_boost_archive_includes, eval = FALSE}
// Include everything we'll need for archiving.
#include <boost/archive/binary_oarchive.hpp>
#include <boost/archive/binary_iarchive.hpp>
#include <sstream>
```
```{Rcpp bs_ns, eval = FALSE}
using namespace Rcpp;
using namespace arma;
```
```{Rcpp bs_test_src, eval = FALSE}
// [[Rcpp::export]]
mat test() {
mat A(3, 6, fill::randu);
return A;
}
```
```{r bs_test_ex}
set.seed(0); (x1 <- test())
set.seed(0); x2 <- test()
identical(x1, x2)
```
```{Rcpp bs_serialize_src, eval = FALSE}
// [[Rcpp::export]]
RawVector test_serialization(Mat<double> m) {
std::stringstream ss;
// save data to archive
{
boost::archive::binary_oarchive oarchive(ss); // create an output archive
oarchive & m; // write class instance to archive
// archive and stream closed when destructors are called
}
ss.seekg(0, ss.end);
RawVector retval(ss.tellg());
ss.seekg(0, ss.beg);
ss.read(reinterpret_cast<char*>(&retval[0]), retval.size());
return retval;
}
```
# References
- Eddelbuettel, D. (2013). Seamless R and C++ Integration with Rcpp. New York, NY: Springer Science & Business Media. http://doi.org/10.1007/978-1-4614-6868-4
- Wickham, H. A. (2014). Advanced R. Chapman and Hall/CRC. http://doi.org/10.1201/b17487