From 20762ebea5bcb8028261d493d3da2d5e45a5da04 Mon Sep 17 00:00:00 2001 From: Danielle Navarro Date: Sat, 16 Jul 2022 21:32:20 +1000 Subject: [PATCH] adds plausible.io --- _quarto.yml | 1 + _site/advanced.html | 14 ++- _site/data-storage.html | 54 ++++----- _site/data-wrangling.html | 30 +++-- _site/hello-arrow.html | 30 ++--- _site/index.html | 14 ++- _site/packages-and-data.html | 14 ++- _site/search.json | 22 ++-- _site/site_libs/bootstrap/bootstrap.min.css | 8 +- _site/site_libs/quarto-html/quarto.js | 68 +++++++++-- _site/site_libs/quarto-nav/quarto-nav.js | 25 ++-- .../site_libs/revealjs/dist/theme/quarto.css | 6 +- _site/sitemap.xml | 14 +-- _site/slides.html | 108 +++++++++--------- plausible.html | 1 + 15 files changed, 238 insertions(+), 171 deletions(-) create mode 100644 plausible.html diff --git a/_quarto.yml b/_quarto.yml index 800ccc0..232d82b 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -35,6 +35,7 @@ format: theme: cosmo css: styles.css toc: true + include-after-body: plausible.html diff --git a/_site/advanced.html b/_site/advanced.html index b48131e..bf530c3 100644 --- a/_site/advanced.html +++ b/_site/advanced.html @@ -2,7 +2,7 @@ - + @@ -80,19 +80,20 @@ code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ + - + - + @@ -119,8 +120,10 @@ + + @@ -205,8 +208,6 @@

Part 4: Advanced Arrow

- -
@@ -566,7 +567,8 @@

The big picture

- + - + - + @@ -119,8 +120,10 @@ + + @@ -202,8 +205,6 @@

Part 3: Data Storage

- -
@@ -314,7 +315,7 @@

Parquet files

invisible() # suppress printing toc()
-
0.484 sec elapsed
+
1.206 sec elapsed
tic()
 parquet_file |>
@@ -322,11 +323,11 @@ 

Parquet files

invisible() toc()
-
0.108 sec elapsed
+
0.183 sec elapsed

This property is handy when dealing with larger-than-memory data: because we can’t load the whole thing into memory, we’re going to have to iteratively read small pieces of the data set. In the next section we’ll talk about how large data sets are typically distributed over many parquet files, but the key thing right now is that whenever we’re loading one of those pieces from a parquet file, an intelligently designed reader will be able to speed things up by reading only the relevant subset each parquet file.

-
+
@@ -478,7 +479,7 @@

Multi-file data sets< invisible() toc()

-
0.012 sec elapsed
+
0.014 sec elapsed
tic()
 nyc_taxi |> 
@@ -487,7 +488,7 @@ 

Multi-file data sets< invisible() toc()

-
2.331 sec elapsed
+
3.895 sec elapsed

Admittedly, this is a bit of a contrived example, but the core point is still important: partitioning the data set on variables that you’re most likely to query on tends to speed things up.

@@ -582,14 +583,14 @@

An example

# A tibble: 12 × 2
    month  distance
    <int>     <dbl>
- 1    12 13642500.
- 2     1 33436823.
+ 1     1 33436823.
+ 2    12 13642500.
  3    10 41799496.
  4    11 13826243.
- 5     3 55384892.
- 6     2 40006137.
- 7     5 52798627.
- 8     4 27440575.
+ 5     2 40006137.
+ 6     3 55384892.
+ 7     4 27440575.
+ 8     5 52798627.
  9     6 15617981.
 10     7 19210103.
 11     8 22581320.
@@ -599,17 +600,17 @@ 

An example

Here’s the time taken for this query:

-
0.484 sec elapsed
+
0.556 sec elapsed

and for the same query performed on the nyc_taxi_2016a data:

-
1.783 sec elapsed
+
4.829 sec elapsed

The difference is not quite as extreme as the contrived example earlier, but it’s still quite substantial: using your domain expertise to choose relevant variables to partition on can make a real difference in how your queries perform!

-
+
@@ -649,12 +650,12 @@

An example

# A tibble: 6 × 3
   pickup_datetime     monthday yearday
   <dttm>                 <int>   <int>
-1 2019-11-01 15:10:52        1     305
-2 2019-11-01 15:03:26        1     305
-3 2019-11-01 15:10:34        1     305
-4 2019-11-01 15:10:34        1     305
-5 2019-11-01 15:14:44        1     305
-6 2019-11-01 15:23:41        1     305
+1 2019-10-02 06:41:22 1 274 +2 2019-10-02 06:53:46 1 274 +3 2019-10-02 06:05:22 1 274 +4 2019-10-02 06:19:59 1 274 +5 2019-10-02 06:45:45 1 274 +6 2019-10-02 06:03:44 1 274
@@ -751,7 +752,8 @@

An example

- + - + - + @@ -119,8 +120,10 @@ + + @@ -219,8 +222,6 @@

Part 2: Data Wrangling with Arrow

- -
@@ -461,7 +462,7 @@

Example 3: S

These changes aren’t arbitrary. Translations are never perfect, and you can see hints of that in this example. The exercises below explore this!

-
+
@@ -793,7 +794,9 @@

Example 6: collect()

Error in `collect()`:
-! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(pickup_location_id) of type int64 and FieldRef.Name(pickup_location_id) of type int32
+! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(pickup_location_id) of type int64 and FieldRef.Name(pickup_location_id) of type int32 +/home/danielle/GitHub/projects/arrow/cpp/src/arrow/compute/exec/hash_join_node.cc:122 ValidateSchemas(join_type, left_schema, left_keys, left_output, right_schema, right_keys, right_output, left_field_name_suffix, right_field_name_suffix) +/home/danielle/GitHub/projects/arrow/cpp/src/arrow/compute/exec/hash_join_node.cc:697 schema_mgr->Init( join_options.join_type, left_schema, join_options.left_keys, join_options.left_output, right_schema, join_options.right_keys, join_options.right_output, join_options.filter, join_options.output_suffix_for_left, join_options.output_suffix_for_right)

If we dig into this error message it’s telling us that we’ve encountered a type mismatch. The nyc_taxi and pickup tables both contain a column named pickup_location_id, and so the left_join() function has attempted to join on that column. However, the two columns don’t store the same kind of data, so Arrow throws an error message. That might seem a little odd because they’re both storing integer values, and that’s the same type of data right?

@@ -916,17 +919,19 @@

Exampl # … with 40 more rows

+ + -
+
@@ -1037,8 +1042,8 @@

Example 7: Wind filter(is.na(sex)) |> select(id, sex, species, island)

-
# Source:   lazy query [?? x 4]
-# Database: DuckDB 0.3.4 [danielle@Linux 5.13.0-51-generic:R 4.2.0/:memory:]
+
# Source:   SQL [?? x 4]
+# Database: DuckDB 0.3.5-dev1410 [danielle@Linux 5.13.0-51-generic:R 4.2.0/:memory:]
       id sex   species island   
    <dbl> <chr> <chr>   <chr>    
  1     4 <NA>  Adelie  Torgersen
@@ -1091,7 +1096,7 @@ 

Example 8: Some collect() toc()

-
0.543 sec elapsed
+
0.438 sec elapsed
numerology
@@ -1130,7 +1135,8 @@

Example 8: Some - + - + - + @@ -119,8 +120,10 @@ + + @@ -211,8 +214,6 @@

Part 1: Hello Arrow

- -
@@ -359,7 +360,7 @@

Let’s get started!

Try typing this out yourself and then have a go at the exercises!

-
+
@@ -388,14 +389,14 @@

Let’s get started!

# A tibble: 12 × 2
    month       n
    <int>   <int>
- 1    11 6877463
- 2    10 7213588
- 3    12 6895933
- 4     1 7667255
- 5     4 7432826
+ 1     1 7667255
+ 2    11 6877463
+ 3    10 7213588
+ 4    12 6895933
+ 5     2 7018750
  6     3 7832035
- 7     5 7564884
- 8     2 7018750
+ 7     4 7432826
+ 8     5 7564884
  9     6 6940489
 10     7 6310134
 11     8 6072851
@@ -555,7 +556,8 @@ 

Data manipulation

- + - + - + @@ -119,8 +120,10 @@ + + @@ -204,8 +207,6 @@

Larger-Than-Memory Data Workflows with Apache Arrow

- -
@@ -306,7 +307,8 @@

When/Where

- + - + - + @@ -119,8 +120,10 @@ + + @@ -212,8 +215,6 @@

Packages and Data

- -
@@ -493,7 +494,8 @@

Extras!

- + + + + + + + + Apache Arrow in R - Larger-Than-Memory Data Workflows with Apache Arrow @@ -47,48 +53,42 @@ } pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; padding-left: 4px; } div.sourceCode - { } + { color: #003b4f; background-color: #f1f3f5; } @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } - code span.al { color: #ff0000; font-weight: bold; } /* Alert */ - code span.an { color: #60a0b0; font-weight: bold; font-style: italic; } /* Annotation */ - code span.at { color: #7d9029; } /* Attribute */ - code span.bn { color: #40a070; } /* BaseN */ + code span { color: #003b4f; } /* Normal */ + code span.al { color: #ad0000; } /* Alert */ + code span.an { color: #5e5e5e; } /* Annotation */ + code span.at { color: #657422; } /* Attribute */ + code span.bn { color: #ad0000; } /* BaseN */ code span.bu { } /* BuiltIn */ - code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */ - code span.ch { color: #4070a0; } /* Char */ - code span.cn { color: #880000; } /* Constant */ - code span.co { color: #60a0b0; font-style: italic; } /* Comment */ - code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; } /* CommentVar */ - code span.do { color: #ba2121; font-style: italic; } /* Documentation */ - code span.dt { color: #902000; } /* DataType */ - code span.dv { color: #40a070; } /* DecVal */ - code span.er { color: #ff0000; font-weight: bold; } /* Error */ + code span.cf { color: #003b4f; } /* ControlFlow */ + code span.ch { color: #20794d; } /* Char */ + code span.cn { color: #8f5902; } /* Constant */ + code span.co { color: #5e5e5e; } /* Comment */ + code span.cv { color: #5e5e5e; font-style: italic; } /* CommentVar */ + code span.do { color: #5e5e5e; font-style: italic; } /* Documentation */ + code span.dt { color: #ad0000; } /* DataType */ + code span.dv { color: #ad0000; } /* DecVal */ + code span.er { color: #ad0000; } /* Error */ code span.ex { } /* Extension */ - code span.fl { color: #40a070; } /* Float */ - code span.fu { color: #06287e; } /* Function */ - code span.im { } /* Import */ - code span.in { color: #60a0b0; font-weight: bold; font-style: italic; } /* Information */ - code span.kw { color: #007020; font-weight: bold; } /* Keyword */ - code span.op { color: #666666; } /* Operator */ - code span.ot { color: #007020; } /* Other */ - code span.pp { color: #bc7a00; } /* Preprocessor */ - code span.sc { color: #4070a0; } /* SpecialChar */ - code span.ss { color: #bb6688; } /* SpecialString */ - code span.st { color: #4070a0; } /* String */ - code span.va { color: #19177c; } /* Variable */ - code span.vs { color: #4070a0; } /* VerbatimString */ - code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; } /* Warning */ + code span.fl { color: #ad0000; } /* Float */ + code span.fu { color: #4758ab; } /* Function */ + code span.im { color: #00769e; } /* Import */ + code span.in { color: #5e5e5e; } /* Information */ + code span.kw { color: #003b4f; } /* Keyword */ + code span.op { color: #5e5e5e; } /* Operator */ + code span.ot { color: #003b4f; } /* Other */ + code span.pp { color: #ad0000; } /* Preprocessor */ + code span.sc { color: #5e5e5e; } /* SpecialChar */ + code span.ss { color: #20794d; } /* SpecialString */ + code span.st { color: #20794d; } /* String */ + code span.va { color: #111111; } /* Variable */ + code span.vs { color: #20794d; } /* VerbatimString */ + code span.wa { color: #5e5e5e; font-style: italic; } /* Warning */ - - - - - - - @@ -102,7 +102,7 @@ } .callout.callout-style-simple { - padding: 0em 0.7em; + padding: 0em 0.5em; border-left: solid #acacac .3rem; border-right: solid 1px silver; border-top: solid 1px silver; @@ -213,8 +213,7 @@ display: none !important; } - .callout.callout-captioned .callout-body > :last-child, - .callout.callout-captioned .callout-body > div > :last-child { + .callout.callout-captioned .callout-body > .callout-content > :last-child { margin-bottom: 0.5rem; } @@ -1069,7 +1068,9 @@

Why didn’t this work?

collect()
Error in `collect()`:
-! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(pickup_location_id) of type int64 and FieldRef.Name(pickup_location_id) of type int32
+! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(pickup_location_id) of type int64 and FieldRef.Name(pickup_location_id) of type int32 +/home/danielle/GitHub/projects/arrow/cpp/src/arrow/compute/exec/hash_join_node.cc:122 ValidateSchemas(join_type, left_schema, left_keys, left_output, right_schema, right_keys, right_output, left_field_name_suffix, right_field_name_suffix) +/home/danielle/GitHub/projects/arrow/cpp/src/arrow/compute/exec/hash_join_node.cc:697 schema_mgr->Init( join_options.join_type, left_schema, join_options.left_keys, join_options.left_output, right_schema, join_options.right_keys, join_options.right_output, join_options.filter, join_options.output_suffix_for_left, join_options.output_suffix_for_right)
@@ -1262,8 +1263,8 @@

An easy fix with {duckdb}

filter(is.na(sex)) |> select(id, sex, species, island)
-
# Source:   lazy query [?? x 4]
-# Database: DuckDB 0.3.4 [danielle@Linux 5.13.0-51-generic:R 4.2.0/:memory:]
+
# Source:   SQL [?? x 4]
+# Database: DuckDB 0.3.5-dev1410 [danielle@Linux 5.13.0-51-generic:R 4.2.0/:memory:]
       id sex   species island   
    <dbl> <chr> <chr>   <chr>    
  1     4 <NA>  Adelie  Torgersen
@@ -1424,7 +1425,7 @@ 

Selective reads are faster

invisible() # suppress printing toc()
-
0.528 sec elapsed
+
1.155 sec elapsed


@@ -1435,7 +1436,7 @@

Selective reads are faster

invisible() toc()
-
0.143 sec elapsed
+
0.191 sec elapsed
@@ -1626,7 +1627,7 @@

Partition structure matters

toc()
-
0.016 sec elapsed
+
0.013 sec elapsed
@@ -1642,7 +1643,7 @@

Partition structure matters

toc()
-
2.952 sec elapsed
+
4.958 sec elapsed
@@ -1992,12 +1993,13 @@

{arrow} brings them together

- - - - - + + + + + + @@ -2238,7 +2240,7 @@

{arrow} brings them together

})(); -