forked from bwlewis/duckdb_and_r
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy paththoughts_on_duckdb.rmd
108 lines (84 loc) · 5.68 KB
/
thoughts_on_duckdb.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# DuckDB and R: Five Easy Pieces
DuckDB (https://duckdb.org) is a nifty new SQL database library, for use
embedded in other programs, in the spirit of other embed-able databases like
SQLite, Berkeley DB, and LMDB, among many others. Unlike SQLite, DuckDB uses a
column-oriented data layout (like R data frames) and has a particular focus on
performance for interactive data analysis and superb integration with popular
data analysis systems like Python and R.
DuckDB's integration with R and Python is so complete that it can work directly
on in-memory language objects without copying. This lets you use DuckDB
directly on Pandas and R data frames, including excellent support for dplyr.
These notes present 5 easy examples of database-like tasks in R that illustrate
stylistic and performance across different R-centric and SQL approaches to
representative problems. The examples are short and range from basic data
processing that SQL excels at to somewhat more complex, but common, specialized
workflows from genomics and finance.
# The easy pieces
* [Large out of core data](https://bwlewis.github.io/duckdb_and_r/taxi/taxi.html)
* [TPCH join-aggregate example from DuckDB](https://bwlewis.github.io/duckdb_and_r/tpch/tpch.html)
* [Last item per group](https://bwlewis.github.io/duckdb_and_r/last/last.html)
* [Genomic overlap joins](https://bwlewis.github.io/duckdb_and_r/ranges/ranges.html)
* ["As of" joins](https://bwlewis.github.io/duckdb_and_r/asof/asof.html)
## *My* opinion
DuckDB is a SQL engine with superb R integration. It exhibits reasonably fast
joins and aggregations, typically among the most important features of a
database system. It is simple to use, loads data robustly, and runs very
fast on larger-than-RAM data (DuckDB's sweet spot, IMO).
Personally, I have, at best, mixed feelings about SQL. On the positive side
lies the ability of good database engines to optimize arbitrary queries for
performance. And, SQL's declarative style and relational model can indeed make
many data wrangling intentions easy to express, at least for simple data
manipulation tasks.
But, many problems linger with SQL implementations. Query optimizers, it turns
out, can only do so much. Worse, there exists substantial syntax variation and
idiosyncrasies across various SQL implementations. For even slightly complex
tasks (see the [last item per group](https://bwlewis.github.io/duckdb_and_r/last/last.html)
or [as of join](https://bwlewis.github.io/duckdb_and_r/asof/asof.html) examples!) SQL often becomes difficult to
write and understand (for me, anyway). I find SQL awkward to compose and nest.
It is hard to do things that should be really simple like mixing
result types in a query (scalars and vectors, say). And there are obvious
downsides of representing SQL programs as big character strings in R--for
instance, all errors can only be caught at run time.
Chamberlin (https://dl.acm.org/doi/10.1109/MAHC.2012.61) memorably described an
"impedance" mismatch between more imperative/procedural programming languages
and the ostensibly more declarative SQL style. I am not entirely convinced by that,
seeing that both SQL and R exhibit each style and both are either functional or
more generally relational in spirit. See this amusing rant for
an example of that:
<a href="https://bwlewis.github.io/duckdb_and_r/last/declarative.html">Declarative, Schmerative!</a>.
<b>Fortunately, there is dplyr (https://dplyr.tidyverse.org/)!</b> Dplyr calls
itself a grammar of data manipulation, but it is also an impedance matching
circuit that lets us write lovely composable R functions and still gain the
benefits of query optimization and performance that the database implementation
has to offer.
Finally, for problems that fit in your system's memory, I don't find any
performance advantage of using DuckDB over alternative R approaches. In
particular data.table (https://github.com/Rdatatable/data.table) is usually
(always?) faster than DuckDB for in-memory problems. Humble base R also often
gets close to or exceeds DuckDB performance. Plus, base R and data.table offer a
substantial array of specialized data manipulation operations that easily
exceed the performance and capabilities of DuckDB, as shown in some of the
pieces. However, DuckDB works without change on problems that are larger than
the computer memory residing in files which can be an important advantage in
some cases.
For me, DuckDB is best used with R using dplyr, a really nice combination of
technologies. In summary:
* Use DuckDB for larger-than-RAM data manipulation.
* Use DuckDB with dplyr, avoid SQL except for basic stuff.
But remember, this is just, like, my opinion, man. I know that many folks are
very comfortable with SQL, and I agree that for many data manipulation tasks
SQL can be quite nice. For those tasks, DuckDB provides a reasonably
high-performance and elegantly integrated solution.
# Notes on performance testing
The examples are all fairly small and short, so take performance measurements
with a grain of salt. Unless otherwise indicated, tests were performed on my
cheapish ($700) Acer Swift 3 laptop with 8GB LPDDR4 3733 MT/s RAM, 8 physical
AMD Ryzen 7 4700U 2GHz CPU cores with a 512GB Samsung NVME SSD drive running
Devuan Beowulf (stable) GNU/Linux with a Linux 5.10 kernel.
Benchmarking, especially on laptops, is very hard due to CPU frequency scaling
for thermal management. For these tests I set the CPU frequency scaling
governor to "performance" and also repeated each test many times to minimize
that problem. By the way, despite a mediocre display, that Acer laptop gives
*outstanding* CPU performance value!
The examples use R version 4.0.2, DuckDB R package version 0.2.6, dplyr R
package version 1.0.4, and the data.table R package version 1.14.0.