Lecture 18
| Task | Timing (ns) | Timing (μs) | 
|---|---|---|
| L1 cache reference | 0.5 | 0.0005 | 
| L2 cache reference | 7 | 0.007 | 
| Main memory reference | 100 | 0.1 | 
| Random seek SSD | 150,000 | 150 | 
| Read 1 MB sequentially from memory | 250,000 | 250 | 
| Read 1 MB sequentially from SSD | 1,000,000 | 1,000 | 
| Disk seek | 10,000,000 | 10,000 | 
| Read 1 MB sequentially from disk | 20,000,000 | 20,000 | 
| Send packet CA->Netherlands->CA | 150,000,000 | 150,000 | 
Lets imagine we have a 10 GB flat data file and that we want to select certain rows based on a particular criteria. This requires a sequential read across the entire data set.
| File Location | Performance | Time | 
|---|---|---|
| in memory | \(10~GB \times (250~\mu s / 1~MB)\) | 2.5 seconds | 
| on disk (SSD) | \(10~GB \times (1~ms / 1~MB)\) | 10 seconds | 
| on disk (HD) | \(10~GB \times (20~ms / 1~MB)\) | 200 seconds | 
This is just for reading sequential data, if we make any modifications (writing) or the data is fragmented things are much worse.
Disk << SSD <<< Memory
Disk <<< SSD << Memory
So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can’t fit everything into memory?
Create blocks - group related data (i.e. rows) and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.
Even with blocks, any kind of querying / subsetting of rows requires a linear search, which requires \(\mathcal{O}(N)\) reads.
We can do better if we are careful about how we structure our data, specifically sorting’ some (or all) of the columns.
Sorting is expensive, \(\mathcal{O}(N \log N)\), but it only needs to be done once.
After sorting, we can use a binary search for any subsetting tasks, \(\mathcal{O}(\log N)\)
In a databases these “sorted” columns are refered to as indexes.
Indexes require additional storage, but usually small enough to be kept in memory even if blocks need to stay on disk.
This is just barely scratching the surface,
Efficiency gains are not just for disk, access is access
In general, trade off between storage and efficiency
Reality is a lot more complicated for everything mentioned so far, lots of very smart people have spent a lot of time thinking about and implementing tools
Different tasks with different requirements require different implementations and have different criteria for optimization
Low level package for interfacing R with Database management systems (DBMS) that provides a common interface to achieve the following functionality:
DBI is a specification, not an implementation, and there are a number of packages that implement the DBI specification for different database systems. Backends for R-DBI lists all available backends, but some notable ones include:
RPostgres
RMariaDB
RSQLite
odbc
bigrquery
duckdb
sparklyr
Provides the implementation necessary to use DBI to interface with an SQLite database.
this package also loads the necessary DBI functions as well (via re-exporting).
Once loaded we can create a connection to our database,
Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
  ..@ ptr                :<externalptr> 
  ..@ dbname             : chr ":memory:"
  ..@ loadable.extensions: logi TRUE
  ..@ flags              : int 70
  ..@ vfs                : chr ""
  ..@ ref                :<environment: 0x1156a6470> 
  ..@ bigint             : chr "integer64"
  ..@ extended_types     : logi FALSEemployees = tibble(
  name   = c("Alice","Bob","Carol","Dave","Eve","Frank"),
  email  = c("alice@company.com", "bob@company.com",
             "carol@company.com", "dave@company.com",
             "eve@company.com",   "frank@comany.com"),
  salary = c(52000, 40000, 30000, 33000, 44000, 37000),
  dept   = c("Accounting", "Accounting","Sales",
             "Accounting","Sales","Sales"),
)
Databases queries are transactional (see ACID) and are broken up into 3 steps:
There is also dbGetQuery() which combines all three steps,
   name             email salary       dept
1 Alice alice@company.com  52000 Accounting
2   Bob   bob@company.com  40000 Accounting
3 Carol carol@company.com  30000      Sales
4  Dave  dave@company.com  33000 Accounting
5   Eve   eve@company.com  44000      Sales
6 Frank  frank@comany.com  37000      SalesdbCreateTable() will create a new table with a schema based on an existing data.frame / tibble, but it does not populate that table with data.
Data can be added to an existing table via dbAppendTable().
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows
# Source:   table<`flights`> [?? x 19]
# Database: sqlite 3.50.4 [flights.sqlite]
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# ℹ more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dbl>All of this data now lives in the database on the filesystem not in memory,
flight_tbl?List of 2
 $ src       :List of 2
  ..$ con  :Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
  .. .. ..@ ptr                :<externalptr> 
  .. .. ..@ dbname             : chr "flights.sqlite"
  .. .. ..@ loadable.extensions: logi TRUE
  .. .. ..@ flags              : int 70
  .. .. ..@ vfs                : chr ""
  .. .. ..@ ref                :<environment: 0x116b81d98> 
  .. .. ..@ bigint             : chr "integer64"
  .. .. ..@ extended_types     : logi FALSE
  ..$ disco: NULL
  ..- attr(*, "class")= chr [1:4] "src_SQLiteConnection" "src_dbi" "src_sql" "src"
 $ lazy_query:List of 5
  ..$ x         : 'dbplyr_table_path' chr "`flights`"
  ..$ vars      : chr [1:19] "year" "month" "day" "dep_time" ...
  ..$ group_vars: chr(0) 
  ..$ order_vars: NULL
  ..$ frame     : NULL
  ..- attr(*, "class")= chr [1:3] "lazy_base_remote_query" "lazy_base_query" "lazy_query"
 - attr(*, "class")= chr [1:5] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...# Source:   table<`flights`> [?? x 19]
# Database: sqlite 3.50.4 [flights.sqlite]
    year month   day dep_time sched_dep_time dep_delay arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>
 1  2013     1     1      517            515         2      830
 2  2013     1     1      533            529         4      850
 3  2013     1     1      542            540         2      923
 4  2013     1     1      544            545        -1     1004
 5  2013     1     1      554            600        -6      812
 6  2013     1     1      554            558        -4      740
 7  2013     1     1      555            600        -5      913
 8  2013     1     1      557            600        -3      709
 9  2013     1     1      557            600        -3      838
10  2013     1     1      558            600        -2      753
# ℹ more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dbl># Source:   SQL [?? x 3]
# Database: sqlite 3.50.4 [flights.sqlite]
   origin dest  tailnum
   <chr>  <chr> <chr>  
 1 EWR    CLT   N152UW 
 2 EWR    IAH   N535UA 
 3 JFK    MIA   N5BSAA 
 4 JFK    SJU   N531JB 
 5 JFK    BQN   N827JB 
 6 LGA    IAH   N15710 
 7 JFK    IAD   N825AS 
 8 EWR    TPA   N802UA 
 9 LGA    ATL   N996DL 
10 JFK    FLL   N627JB 
# ℹ more rowsdplyr / dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.
When building a query, we don’t want the entire table, often we want just enough to check if our query is working / makes sense.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
won’t connect and query the database until absolutely necessary (e.g. show output),
and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like.
we can force evaluation via compute(), collect(), or collapse()
show_query() - dplyr to SQLIn general, dplyr / dbplyr knows how to translate basic math, logical, and summary functions from R to SQL. dbplyr has a function, translate_sql(), that lets you experiment with how R functions are translated to SQL.
Error in `sd()`:
! `sd()` is not available in this SQL variant.<SQL> CONCAT_WS(' ', `x`, `y`)Warning: Windowed expression `SUM(`x`)` does not have explicit order.
ℹ Please use `arrange()` or `window_order()` to make deterministic.<SQL> SUM(`x`) OVER (ROWS UNBOUNDED PRECEDING)<SQL> LAG(`x`, 1, NULL) OVER ()By default dbplyr::translate_sql() will translate R / dplyr code into ANSI SQL, if we want to see results specific to a certain database we can pass in a connection object,
<SQL> STDEV(`x`) OVER ()<SQL> `x` || ' ' || `y`Warning: Windowed expression `SUM(`x`)` does not have explicit order.
ℹ Please use `arrange()` or `window_order()` to make deterministic.<SQL> SUM(`x`) OVER (ROWS UNBOUNDED PRECEDING)<SQL> LAG(`x`, 1, NULL) OVER ()Error in `collect()`:
! Failed to collect lazy table.
Caused by error:
! no such function: greplThere are two packages that implement this in R which take different approaches,
# A tibble: 181 × 3
   origin dest      n
   <chr>  <chr> <int>
 1 EWR    ATL      15
 2 EWR    AUS       3
 3 EWR    AVL       1
 4 EWR    BNA       7
 5 EWR    BOS      17
 6 EWR    BTV       3
 7 EWR    BUF       2
 8 EWR    BWI       1
 9 EWR    CHS       4
10 EWR    CLE       4
# ℹ 171 more rows# A tibble: 181 × 3
   origin dest      n
   <chr>  <chr> <int>
 1 JFK    LAX      32
 2 LGA    ORD      31
 3 LGA    ATL      30
 4 JFK    SFO      24
 5 LGA    CLT      22
 6 EWR    ORD      18
 7 EWR    SFO      18
 8 EWR    BOS      17
 9 LGA    MIA      17
10 EWR    LAX      16
# ℹ 171 more rowsWarning in fun(libname, pkgname): no display name and no $DISPLAY
environment variable    origin dest  n
1      EWR  ATL 15
2      EWR  AUS  3
3      EWR  AVL  1
4      EWR  BNA  7
5      EWR  BOS 17
6      EWR  BTV  3
7      EWR  BUF  2
8      EWR  BWI  1
9      EWR  CHS  4
10     EWR  CLE  4
11     EWR  CLT 15
12     EWR  CMH  3
13     EWR  CVG  9
14     EWR  DAY  4
15     EWR  DCA  3
16     EWR  DEN  8
17     EWR  DFW  9
18     EWR  DSM  2
19     EWR  DTW 10
20     EWR  FLL 10
21     EWR  GRR  2
22     EWR  GSO  4
23     EWR  GSP  2
24     EWR  HNL  1
25     EWR  HOU  3
26     EWR  IAD  5
27     EWR  IAH 11
28     EWR  IND  5
29     EWR  JAX  4
30     EWR  LAS  6
31     EWR  LAX 16
32     EWR  MCI  4
33     EWR  MCO 13
34     EWR  MDW  6
35     EWR  MEM  3
36     EWR  MHT  3
37     EWR  MIA  7
38     EWR  MKE  3
39     EWR  MSN  1
40     EWR  MSP 10
41     EWR  MSY  4
42     EWR  OKC  1
43     EWR  OMA  2
44     EWR  ORD 18
45     EWR  ORF  1
46     EWR  PBI  5
47     EWR  PDX  2
48     EWR  PHX  7
49     EWR  PIT  1
50     EWR  PVD  1
51     EWR  PWM  1
52     EWR  RDU  4
53     EWR  RIC  5
54     EWR  RSW  3
55     EWR  SAN  3
56     EWR  SAT  1
57     EWR  SAV  2
58     EWR  SDF  3
59     EWR  SEA  5
60     EWR  SFO 18
61     EWR  SJU  2
62     EWR  SLC  1
63     EWR  SNA  3
64     EWR  STL  8
65     EWR  TPA  6
66     EWR  TUL  1
67     EWR  TYS  1
68     EWR  XNA  1
69     JFK  ABQ  1
70     JFK  ATL  5
71     JFK  AUS  4
72     JFK  BNA  2
73     JFK  BOS 16
74     JFK  BQN  1
75     JFK  BTV  4
76     JFK  BUF 12
77     JFK  BUR  1
78     JFK  BWI  3
79     JFK  CHS  3
80     JFK  CLE  1
81     JFK  CLT  8
82     JFK  CMH  2
83     JFK  CVG  3
84     JFK  DCA  9
85     JFK  DEN  2
86     JFK  DFW  2
87     JFK  DTW  3
88     JFK  FLL  9
89     JFK  HNL  1
90     JFK  HOU  2
91     JFK  IAD  7
92     JFK  IAH  1
93     JFK  IND  2
94     JFK  JAX  3
95     JFK  LAS 12
96     JFK  LAX 32
97     JFK  LGB  2
98     JFK  MCI  1
99     JFK  MCO 14
100    JFK  MIA  9
101    JFK  MSP  3
102    JFK  MSY  5
103    JFK  OAK  1
104    JFK  ORD  7
105    JFK  ORF  2
106    JFK  PBI  4
107    JFK  PDX  2
108    JFK  PHL  2
109    JFK  PHX  6
110    JFK  PIT  3
111    JFK  PSE  1
112    JFK  PWM  4
113    JFK  RDU  9
114    JFK  ROC  4
115    JFK  RSW  2
116    JFK  SAN  5
117    JFK  SAT  1
118    JFK  SEA  7
119    JFK  SFO 24
120    JFK  SJC  1
121    JFK  SJU 11
122    JFK  SLC  6
123    JFK  SMF  1
124    JFK  SRQ  1
125    JFK  SYR  4
126    JFK  TPA  8
127    LGA  ATL 30
128    LGA  BGR  2
129    LGA  BHM  1
130    LGA  BNA 11
131    LGA  BOS 16
132    LGA  BTV  1
133    LGA  BUF  2
134    LGA  CAK  2
135    LGA  CHS  3
136    LGA  CLE  6
137    LGA  CLT 22
138    LGA  CMH  7
139    LGA  CVG  1
140    LGA  DAY  1
141    LGA  DCA 16
142    LGA  DEN 11
143    LGA  DFW 14
144    LGA  DSM  1
145    LGA  DTW 15
146    LGA  FLL 10
147    LGA  GSO  4
148    LGA  GSP  1
149    LGA  HOU  2
150    LGA  IAD  6
151    LGA  IAH  9
152    LGA  ILM  1
153    LGA  IND  1
154    LGA  JAX  2
155    LGA  MCI  1
156    LGA  MCO 10
157    LGA  MDW  6
158    LGA  MEM  3
159    LGA  MIA 17
160    LGA  MKE  5
161    LGA  MSN  2
162    LGA  MSP 11
163    LGA  MSY  3
164    LGA  OMA  1
165    LGA  ORD 31
166    LGA  ORF  1
167    LGA  PBI  6
168    LGA  PHL  2
169    LGA  PIT  5
170    LGA  RDU 11
171    LGA  RIC  3
172    LGA  ROC  2
173    LGA  RSW  2
174    LGA  SAV  1
175    LGA  SDF  1
176    LGA  SRQ  2
177    LGA  STL  6
178    LGA  SYR  1
179    LGA  TPA  6
180    LGA  TYS  1
181    LGA  XNA  3# A tibble: 181 × 3
   origin dest      n
   <chr>  <chr> <int>
 1 JFK    LAX      32
 2 LGA    ORD      31
 3 LGA    ATL      30
 4 JFK    SFO      24
 5 LGA    CLT      22
 6 EWR    ORD      18
 7 EWR    SFO      18
 8 EWR    BOS      17
 9 LGA    MIA      17
10 EWR    LAX      16
# ℹ 171 more rowsThe ability of dplyr to translate from R expression to SQL is an incredibly powerful tool making your data processing workflows portable across a wide variety of data backends.
Some tools and ecosystems that are worth learning about:
Sta 523 - Fall 2025