Lecture 08
Combination of two sets of slides from previous years:
Happy families are all alike; every unhappy family is unhappy in its own way
— Leo Tolstoy, Anna Karenina
# A tibble: 317 × 7
artist track date.entered wk1 wk2 wk3 wk4
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby Don't Cry… 2000-02-26 87 82 72 77
2 2Ge+her The Hardest Pa… 2000-09-02 91 87 92 NA
3 3 Doors Down Kryptonite 2000-04-08 81 70 68 67
4 3 Doors Down Loser 2000-10-21 76 76 72 69
5 504 Boyz Wobble Wobble 2000-04-15 57 34 25 17
6 98^0 Give Me Just O… 2000-08-19 51 39 34 26
7 A*Teens Dancing Queen 2000-07-08 97 97 96 95
8 Aaliyah I Don't Wanna 2000-01-29 84 62 51 41
9 Aaliyah Try Again 2000-03-18 59 53 38 28
10 Adams, Yolanda Open My Heart 2000-08-26 76 76 74 69
# ℹ 307 more rows
Is this data tidy?
The tidyverse includes the tibble package that extends data frames to be a bit more modern. The core features of tibbles is to have a nicer printing method as well as being “surly” and “lazy”.
Sepal.Length Sepal.Width Petal.Length
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5.0 3.6 1.4
6 5.4 3.9 1.7
7 4.6 3.4 1.4
8 5.0 3.4 1.5
9 4.4 2.9 1.4
10 4.9 3.1 1.5
11 5.4 3.7 1.5
12 4.8 3.4 1.6
13 4.8 3.0 1.4
14 4.3 3.0 1.1
15 5.8 4.0 1.2
16 5.7 4.4 1.5
17 5.4 3.9 1.3
18 5.1 3.5 1.4
19 5.7 3.8 1.7
20 5.1 3.8 1.5
21 5.4 3.4 1.7
22 5.1 3.7 1.5
23 4.6 3.6 1.0
24 5.1 3.3 1.7
25 4.8 3.4 1.9
26 5.0 3.0 1.6
27 5.0 3.4 1.6
28 5.2 3.5 1.5
29 5.2 3.4 1.4
30 4.7 3.2 1.6
31 4.8 3.1 1.6
32 5.4 3.4 1.5
33 5.2 4.1 1.5
34 5.5 4.2 1.4
35 4.9 3.1 1.5
36 5.0 3.2 1.2
37 5.5 3.5 1.3
38 4.9 3.6 1.4
39 4.4 3.0 1.3
40 5.1 3.4 1.5
41 5.0 3.5 1.3
42 4.5 2.3 1.3
43 4.4 3.2 1.3
44 5.0 3.5 1.6
45 5.1 3.8 1.9
46 4.8 3.0 1.4
47 5.1 3.8 1.6
48 4.6 3.2 1.4
49 5.3 3.7 1.5
50 5.0 3.3 1.4
51 7.0 3.2 4.7
52 6.4 3.2 4.5
53 6.9 3.1 4.9
54 5.5 2.3 4.0
55 6.5 2.8 4.6
56 5.7 2.8 4.5
57 6.3 3.3 4.7
58 4.9 2.4 3.3
59 6.6 2.9 4.6
60 5.2 2.7 3.9
61 5.0 2.0 3.5
62 5.9 3.0 4.2
63 6.0 2.2 4.0
64 6.1 2.9 4.7
65 5.6 2.9 3.6
66 6.7 3.1 4.4
67 5.6 3.0 4.5
68 5.8 2.7 4.1
69 6.2 2.2 4.5
70 5.6 2.5 3.9
71 5.9 3.2 4.8
72 6.1 2.8 4.0
73 6.3 2.5 4.9
74 6.1 2.8 4.7
75 6.4 2.9 4.3
76 6.6 3.0 4.4
77 6.8 2.8 4.8
78 6.7 3.0 5.0
79 6.0 2.9 4.5
80 5.7 2.6 3.5
81 5.5 2.4 3.8
82 5.5 2.4 3.7
83 5.8 2.7 3.9
84 6.0 2.7 5.1
85 5.4 3.0 4.5
86 6.0 3.4 4.5
87 6.7 3.1 4.7
88 6.3 2.3 4.4
89 5.6 3.0 4.1
90 5.5 2.5 4.0
91 5.5 2.6 4.4
92 6.1 3.0 4.6
93 5.8 2.6 4.0
94 5.0 2.3 3.3
95 5.6 2.7 4.2
96 5.7 3.0 4.2
97 5.7 2.9 4.2
98 6.2 2.9 4.3
99 5.1 2.5 3.0
100 5.7 2.8 4.1
101 6.3 3.3 6.0
102 5.8 2.7 5.1
103 7.1 3.0 5.9
104 6.3 2.9 5.6
105 6.5 3.0 5.8
106 7.6 3.0 6.6
107 4.9 2.5 4.5
108 7.3 2.9 6.3
109 6.7 2.5 5.8
110 7.2 3.6 6.1
111 6.5 3.2 5.1
112 6.4 2.7 5.3
113 6.8 3.0 5.5
114 5.7 2.5 5.0
115 5.8 2.8 5.1
116 6.4 3.2 5.3
117 6.5 3.0 5.5
118 7.7 3.8 6.7
119 7.7 2.6 6.9
120 6.0 2.2 5.0
121 6.9 3.2 5.7
122 5.6 2.8 4.9
123 7.7 2.8 6.7
124 6.3 2.7 4.9
125 6.7 3.3 5.7
126 7.2 3.2 6.0
127 6.2 2.8 4.8
128 6.1 3.0 4.9
129 6.4 2.8 5.6
130 7.2 3.0 5.8
131 7.4 2.8 6.1
132 7.9 3.8 6.4
133 6.4 2.8 5.6
134 6.3 2.8 5.1
135 6.1 2.6 5.6
136 7.7 3.0 6.1
137 6.3 3.4 5.6
138 6.4 3.1 5.5
139 6.0 3.0 4.8
140 6.9 3.1 5.4
141 6.7 3.1 5.6
142 6.9 3.1 5.1
143 5.8 2.7 5.1
144 6.8 3.2 5.9
145 6.7 3.3 5.7
146 6.7 3.0 5.2
147 6.3 2.5 5.0
148 6.5 3.0 5.2
149 6.2 3.4 5.4
150 5.9 3.0 5.1
Petal.Width Species
1 0.2 setosa
2 0.2 setosa
3 0.2 setosa
4 0.2 setosa
5 0.2 setosa
6 0.4 setosa
7 0.3 setosa
8 0.2 setosa
9 0.2 setosa
10 0.1 setosa
11 0.2 setosa
12 0.2 setosa
13 0.1 setosa
14 0.1 setosa
15 0.2 setosa
16 0.4 setosa
17 0.4 setosa
18 0.3 setosa
19 0.3 setosa
20 0.3 setosa
21 0.2 setosa
22 0.4 setosa
23 0.2 setosa
24 0.5 setosa
25 0.2 setosa
26 0.2 setosa
27 0.4 setosa
28 0.2 setosa
29 0.2 setosa
30 0.2 setosa
31 0.2 setosa
32 0.4 setosa
33 0.1 setosa
34 0.2 setosa
35 0.2 setosa
36 0.2 setosa
37 0.2 setosa
38 0.1 setosa
39 0.2 setosa
40 0.2 setosa
41 0.3 setosa
42 0.3 setosa
43 0.2 setosa
44 0.6 setosa
45 0.4 setosa
46 0.3 setosa
47 0.2 setosa
48 0.2 setosa
49 0.2 setosa
50 0.2 setosa
51 1.4 versicolor
52 1.5 versicolor
53 1.5 versicolor
54 1.3 versicolor
55 1.5 versicolor
56 1.3 versicolor
57 1.6 versicolor
58 1.0 versicolor
59 1.3 versicolor
60 1.4 versicolor
61 1.0 versicolor
62 1.5 versicolor
63 1.0 versicolor
64 1.4 versicolor
65 1.3 versicolor
66 1.4 versicolor
67 1.5 versicolor
68 1.0 versicolor
69 1.5 versicolor
70 1.1 versicolor
71 1.8 versicolor
72 1.3 versicolor
73 1.5 versicolor
74 1.2 versicolor
75 1.3 versicolor
76 1.4 versicolor
77 1.4 versicolor
78 1.7 versicolor
79 1.5 versicolor
80 1.0 versicolor
81 1.1 versicolor
82 1.0 versicolor
83 1.2 versicolor
84 1.6 versicolor
85 1.5 versicolor
86 1.6 versicolor
87 1.5 versicolor
88 1.3 versicolor
89 1.3 versicolor
90 1.3 versicolor
91 1.2 versicolor
92 1.4 versicolor
93 1.2 versicolor
94 1.0 versicolor
95 1.3 versicolor
96 1.2 versicolor
97 1.3 versicolor
98 1.3 versicolor
99 1.1 versicolor
100 1.3 versicolor
101 2.5 virginica
102 1.9 virginica
103 2.1 virginica
104 1.8 virginica
105 2.2 virginica
106 2.1 virginica
107 1.7 virginica
108 1.8 virginica
109 1.8 virginica
110 2.5 virginica
111 2.0 virginica
112 1.9 virginica
113 2.1 virginica
114 2.0 virginica
115 2.4 virginica
116 2.3 virginica
117 1.8 virginica
118 2.2 virginica
119 2.3 virginica
120 1.5 virginica
121 2.3 virginica
122 2.0 virginica
123 2.0 virginica
124 1.8 virginica
125 2.1 virginica
126 1.8 virginica
127 1.8 virginica
128 1.8 virginica
129 2.1 virginica
130 1.6 virginica
131 1.9 virginica
132 2.0 virginica
133 2.2 virginica
134 1.5 virginica
135 1.4 virginica
136 2.3 virginica
137 2.4 virginica
138 1.8 virginica
139 1.8 virginica
140 2.1 virginica
141 2.4 virginica
142 2.3 virginica
143 1.9 virginica
144 2.3 virginica
145 2.5 virginica
146 2.3 virginica
147 1.9 virginica
148 2.0 virginica
149 2.3 virginica
150 1.8 virginica
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length
<dbl> <dbl> <dbl>
1 5.1 3.5 1.4
2 4.9 3 1.4
3 4.7 3.2 1.3
4 4.6 3.1 1.5
5 5 3.6 1.4
6 5.4 3.9 1.7
7 4.6 3.4 1.4
8 5 3.4 1.5
9 4.4 2.9 1.4
10 4.9 3.1 1.5
# ℹ 140 more rows
# ℹ 2 more variables: Petal.Width <dbl>,
# Species <fct>
By default, subsetting tibbles always results in another tibble ($ or [[ can still be used to subset for a specific column). i.e. tibble subsets are always preserving and therefore type consistent.
Tibbles do not use partial matching when the $ operator is used.
Only vectors with length 1 will undergo length coercion / recycling - anything else throws an error.
[1] [ [[ [[<- [<- $
[6] $<- as.data.frame coerce initialize names<-
[11] Ops row.names<- show slotsFromS3 str
[16] tbl_sum
see '?methods' for accessing help and source code
[1] [[<- [<- $<- coerce format
[6] glimpse initialize Ops print show
[11] slotsFromS3 tbl_sum
see '?methods' for accessing help and source code
Tibbles are just specialized data frames, and will fall back to base data frame methods when needed.
Why did this work?
magrittr
In software engineering, a pipeline consists of a chain of processing elements (processes, threads, coroutines, functions, etc.), arranged so that the output of each element is the input of the next;
Magrittr’s pipe (%>%) is an infix operator that allows us to link two functions together in a way that is readable from left to right.
The three code examples below are equivalent:
The current version of RStudio on the departmental servers is v4.4.1 so you are welcome to use it.
All of the following are fine; it comes down to personal preference. Try to be consistent.
Nested:
Piped:
Intermediate:
Sometimes we want to send our results to an function argument other than first one or we want to use the previous result for multiple arguments. In these cases we can refer to the previous result using ..
dplyr is based on the concepts of functions as verbs that manipulate data frames.
Core single data frame functions / verbs:
filter() / slice() - pick rows based on criteriamutate() / transmute() - create or modify columnssummarise() / count() - reduce variables to valuesgroup_by() / ungroup() - modify other verbs to act on subsetsselect() / rename() - select columns by namepull() - grab a column as a vectorarrange() - reorder rowsdistinct() - filter for unique rowsrelocate() - change column orderFirst argument is always a data frame
Subsequent arguments say what to do with the data frame
Always return a data frame
Don’t modify in place (how does this affect memory usage?)
Magic via non-standard evaluation + lazy evaluation and S3
We will demonstrate dplyr’s functionality using the nycflights13 data.
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
# A tibble: 28,834 × 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 3 1 4 2159 125 318
2 2013 3 1 50 2358 52 526
3 2013 3 1 117 2245 152 223
4 2013 3 1 454 500 -6 633
5 2013 3 1 505 515 -10 746
6 2013 3 1 521 530 -9 813
7 2013 3 1 537 540 -3 856
8 2013 3 1 541 545 -4 1014
9 2013 3 1 549 600 -11 639
10 2013 3 1 550 600 -10 747
# ℹ 28,824 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
# A tibble: 6,530 × 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 3 1 4 2159 125 318
2 2013 3 1 50 2358 52 526
3 2013 3 1 117 2245 152 223
4 2013 3 1 454 500 -6 633
5 2013 3 1 505 515 -10 746
6 2013 3 1 521 530 -9 813
7 2013 3 1 537 540 -3 856
8 2013 3 1 541 545 -4 1014
9 2013 3 1 549 600 -11 639
10 2013 3 1 550 600 -10 747
# ℹ 6,520 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
# A tibble: 1,178 × 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 3 1 607 610 -3 832
2 2013 3 1 629 632 -3 844
3 2013 3 1 657 700 -3 953
4 2013 3 1 714 715 -1 939
5 2013 3 1 716 710 6 958
6 2013 3 1 727 730 -3 1007
7 2013 3 1 836 840 -4 1111
8 2013 3 1 857 900 -3 1202
9 2013 3 1 903 900 3 1157
10 2013 3 1 904 831 33 1150
# ℹ 1,168 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
# A tibble: 10 × 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
# A tibble: 3 × 19
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 9 30 NA 1210 NA NA
2 2013 9 30 NA 1159 NA NA
3 2013 9 30 NA 840 NA NA
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
Could also do
# A tibble: 336,776 × 16
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
<int> <int> <dbl> <int> <int> <dbl>
1 517 515 2 830 819 11
2 533 529 4 850 830 20
3 542 540 2 923 850 33
4 544 545 -1 1004 1022 -18
5 554 600 -6 812 837 -25
6 554 558 -4 740 728 12
7 555 600 -5 913 854 19
8 557 600 -3 709 723 -14
9 557 600 -3 838 846 -8
10 558 600 -2 753 745 8
# ℹ 336,766 more rows
# ℹ 10 more variables: carrier <chr>, flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>
tidyselect# A tibble: 336,776 × 5
dep_time sched_dep_time dep_delay arr_time arr_delay
<int> <int> <dbl> <int> <dbl>
1 517 515 2 830 11
2 533 529 4 850 20
3 542 540 2 923 33
4 544 545 -1 1004 -18
5 554 600 -6 812 -25
6 554 558 -4 740 12
7 555 600 -5 913 19
8 557 600 -3 709 -14
9 557 600 -3 838 -8
10 558 600 -2 753 8
# ℹ 336,766 more rows
# A tibble: 336,776 × 14
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ℹ 336,766 more rows
# ℹ 7 more variables: sched_arr_time <int>, arr_delay <dbl>, flight <int>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>
# A tibble: 336,776 × 19
tail_number year month day dep_time sched_dep_time dep_delay arr_time
<chr> <int> <int> <int> <int> <int> <dbl> <int>
1 N14228 2013 1 1 517 515 2 830
2 N24211 2013 1 1 533 529 4 850
3 N619AA 2013 1 1 542 540 2 923
4 N804JB 2013 1 1 544 545 -1 1004
5 N668DN 2013 1 1 554 600 -6 812
6 N39463 2013 1 1 554 558 -4 740
7 N516JB 2013 1 1 555 600 -5 913
8 N829AS 2013 1 1 557 600 -3 709
9 N593JB 2013 1 1 557 600 -3 838
10 N3ALAA 2013 1 1 558 600 -2 753
# ℹ 336,766 more rows
# ℹ 11 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, origin <chr>, dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 765 × 20
original_row_number origin dest year month day dep_time
<int> <chr> <chr> <int> <int> <int> <int>
1 137570 EWR ALB 2013 3 2 1336
2 137257 EWR ATL 2013 3 2 628
3 137262 EWR ATL 2013 3 2 637
4 137303 EWR ATL 2013 3 2 743
5 137391 EWR ATL 2013 3 2 857
6 137455 EWR ATL 2013 3 2 1027
7 137491 EWR ATL 2013 3 2 1134
8 137592 EWR ATL 2013 3 2 1412
9 137726 EWR ATL 2013 3 2 1633
10 137750 EWR ATL 2013 3 2 1655
# ℹ 755 more rows
# ℹ 13 more variables: sched_dep_time <int>, dep_delay <dbl>,
# arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
# flight <int>, tailnum <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 336,776 × 5
year month day date days_until_2025
<int> <int> <int> <chr> <dbl>
1 2013 1 1 2013-1-1 4383
2 2013 1 1 2013-1-1 4383
3 2013 1 1 2013-1-1 4383
4 2013 1 1 2013-1-1 4383
5 2013 1 1 2013-1-1 4383
6 2013 1 1 2013-1-1 4383
7 2013 1 1 2013-1-1 4383
8 2013 1 1 2013-1-1 4383
9 2013 1 1 2013-1-1 4383
10 2013 1 1 2013-1-1 4383
# ℹ 336,766 more rows
# A tibble: 336,776 × 19
# Groups: origin [3]
year month day dep_time sched_dep_time dep_delay arr_time
<int> <int> <int> <int> <int> <dbl> <int>
1 2013 1 1 517 515 2 830
2 2013 1 1 533 529 4 850
3 2013 1 1 542 540 2 923
4 2013 1 1 544 545 -1 1004
5 2013 1 1 554 600 -6 812
6 2013 1 1 554 558 -4 740
7 2013 1 1 555 600 -5 913
8 2013 1 1 557 600 -3 709
9 2013 1 1 557 600 -3 838
10 2013 1 1 558 600 -2 753
# ℹ 336,766 more rows
# ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
`summarise()` has grouped output by 'origin'. You can override using the
`.groups` argument.
# A tibble: 36 × 5
# Groups: origin [3]
origin month n min_dep_delay max_dep_delay
<chr> <int> <int> <dbl> <dbl>
1 EWR 1 9893 -21 1126
2 EWR 2 9107 -21 786
3 EWR 3 10420 -22 443
4 EWR 4 10531 -21 545
5 EWR 5 10592 -20 878
6 EWR 6 10175 -19 502
7 EWR 7 10475 -18 653
8 EWR 8 10359 -17 424
9 EWR 9 9550 -23 486
10 EWR 10 10104 -25 702
# ℹ 26 more rows
# A tibble: 36 × 5
origin month n min_dep_delay
<chr> <int> <int> <dbl>
1 EWR 1 9893 -21
2 EWR 2 9107 -21
3 EWR 3 10420 -22
4 EWR 4 10531 -21
5 EWR 5 10592 -20
6 EWR 6 10175 -19
7 EWR 7 10475 -18
8 EWR 8 10359 -17
9 EWR 9 9550 -23
10 EWR 10 10104 -25
# ℹ 26 more rows
# ℹ 1 more variable: max_dep_delay <dbl>
# A tibble: 36 × 5
# Groups: origin, month [36]
origin month n min_dep_delay
<chr> <int> <int> <dbl>
1 EWR 1 9893 -21
2 EWR 2 9107 -21
3 EWR 3 10420 -22
4 EWR 4 10531 -21
5 EWR 5 10592 -20
6 EWR 6 10175 -19
7 EWR 7 10475 -18
8 EWR 8 10359 -17
9 EWR 9 9550 -23
10 EWR 10 10104 -25
# ℹ 26 more rows
# ℹ 1 more variable: max_dep_delay <dbl>
The .by (and by) arguments are used for per operation grouping while group_by() is intended for persistent grouping. See ?dplyr_by for more details and examples.
# A tibble: 336,776 × 5
origin arr_delay arrival_delay_normalized_by_origin sd_ad mean_ad
<chr> <dbl> <dbl> <dbl> <dbl>
1 EWR 11 0.0416 45.5 9.11
2 LGA 20 0.324 43.9 5.78
3 JFK 33 0.620 44.3 5.55
4 JFK -18 -0.532 44.3 5.55
5 LGA -25 -0.702 43.9 5.78
6 EWR 12 0.0635 45.5 9.11
7 EWR 19 0.217 45.5 9.11
8 LGA -14 -0.451 43.9 5.78
9 JFK -8 -0.306 44.3 5.55
10 LGA 8 0.0505 43.9 5.78
# ℹ 336,766 more rows
How many flights to Los Angeles (LAX) did each of the legacy carriers (AA, UA, DL or US) have in May from JFK, and what was their average duration?
What was the shortest flight out of each airport in terms of distance? In terms of duration?
Which plane (check the tail number) flew out of each New York airport the most?
Which date should you fly on if you want to have the lowest possible average departure delay? What about arrival delay?
pivot_longer (previously gather or reshape2::melt)
pivot_wider (previously spread)
( d = tibble::tribble(
~country, ~year, ~type, ~count,
"A", 1999, "cases", "0.7K",
"A", 1999, "pop", "19M",
"A", 2000, "cases", "2K",
"A", 2000, "pop", "20M",
"B", 1999, "cases", "37K",
"B", 1999, "pop", "172M",
"B", 2000, "cases", " 80K",
"B", 2000, "pop", "174M",
"C", 1999, "cases", "212K",
"C", 1999, "pop", "1T",
"C", 2000, "cases", "213K",
"C", 2000, "pop", "1T"
)
)# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <chr>
1 A 1999 cases "0.7K"
2 A 1999 pop "19M"
3 A 2000 cases "2K"
4 A 2000 pop "20M"
5 B 1999 cases "37K"
6 B 1999 pop "172M"
7 B 2000 cases " 80K"
8 B 2000 pop "174M"
9 C 1999 cases "212K"
10 C 1999 pop "1T"
11 C 2000 cases "213K"
12 C 2000 pop "1T"
The palmerpenguin package contains measurement data on various penguin species on islands near Palmer Station in Antarctica. The code below shows the # of each species measured on each of the three islands (missing island, penguin pairs implies that species does not occur on that island).
# A tibble: 5 × 3
island species n
<fct> <fct> <int>
1 Biscoe Adelie 44
2 Biscoe Gentoo 124
3 Dream Adelie 56
4 Dream Chinstrap 68
5 Torgersen Adelie 52
Starting from these data construct a contingency table of counts for island (rows) by species (columns) using the pivot functions we’ve just discussed.
05:00
In previous versions of tidyr there was a single catch-all separate() function. This still exists and is available in the package but it is superseded.
Other helpful separate functions:
separate_longer_position()
separate_wider_position()
separate_wider_regex()
Is the following data tidy?
How would we calculate a final score based on the following formula, \[\text{score} = 0.5\,\frac{\sum_i\text{hw}_i}{80} + 0.5\,\frac{\sum_j\text{proj}_j}{200}\]
# A tibble: 4 × 10
name hw_1 hw_2 hw_3 hw_4 proj_1 proj_2 hw_avg proj_avg overall
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alice 19 19 18 20 89 95 19 92 0.935
2 Bob 18 20 18 16 77 88 18 82.5 0.862
3 Carol 18 20 18 17 96 99 18.2 97.5 0.944
4 Dave 19 19 18 19 86 82 18.8 84 0.889
# A tibble: 24 × 3
name assignment score
<chr> <chr> <dbl>
1 Alice hw_1 19
2 Alice hw_2 19
3 Alice hw_3 18
4 Alice hw_4 20
5 Alice proj_1 89
6 Alice proj_2 95
7 Bob hw_1 18
8 Bob hw_2 20
9 Bob hw_3 18
10 Bob hw_4 16
# ℹ 14 more rows
# A tibble: 24 × 4
name type id score
<chr> <chr> <chr> <dbl>
1 Alice hw 1 19
2 Alice hw 2 19
3 Alice hw 3 18
4 Alice hw 4 20
5 Alice proj 1 89
6 Alice proj 2 95
7 Bob hw 1 18
8 Bob hw 2 20
9 Bob hw 3 18
10 Bob hw 4 16
# ℹ 14 more rows
# A tibble: 8 × 3
name type total
<chr> <chr> <dbl>
1 Alice hw 76
2 Alice proj 184
3 Bob hw 72
4 Bob proj 165
5 Carol hw 73
6 Carol proj 195
7 Dave hw 75
8 Dave proj 168
# A tibble: 4 × 3
name hw proj
<chr> <dbl> <dbl>
1 Alice 76 184
2 Bob 72 165
3 Carol 73 195
4 Dave 75 168
final_grades <- grades |>
tidyr::pivot_longer(
cols = hw_1:proj_2,
names_to = c("type", "id"),
names_sep = "_",
values_to = "score"
) |>
summarize(
total = sum(score),
.by = c(name, type)
) |>
tidyr::pivot_wider(
names_from = type,
values_from = total
) |>
mutate(
score = 0.5*(hw/80) +
0.5*(proj/200)
)
final_grades# A tibble: 4 × 4
name hw proj score
<chr> <dbl> <dbl> <dbl>
1 Alice 76 184 0.935
2 Bob 72 165 0.862
3 Carol 73 195 0.944
4 Dave 75 168 0.889
byBy default dplyr’s join functions will join based on intersecting column names between the two data frames.
To specify the columns to join by (or to handle non-matching names) pass in a character vector of column names (or a named character vector where the names match the left data frame and the values match the right).
Recently more advanced joins have been implemented using the join_by() construct which allows for: equality, inequality, rolling, overlap, and cross joins. See ?join_by for details.
behavior_reports <- tibble::tribble(
~incident_name, ~person_1, ~person_2, ~person_3,
"Food fight", "alice,bob", "carole,dave", NA,
"'Kick me'", "alice", "carole", NA,
"Flash mob", "bob", "carol", "Dave",
"Playing games", "Alice", "Bob", NA
) |>
pivot_longer(person_1:person_3,
values_to = "person_involved",
values_drop_na = TRUE) |>
separate_longer_delim(person_involved, delim = ",") |>
summarise(num_incidents = n(), .by = person_involved)
behavior_reports# A tibble: 8 × 2
person_involved num_incidents
<chr> <int>
1 alice 2
2 bob 2
3 carole 2
4 dave 1
5 carol 1
6 Dave 1
7 Alice 1
8 Bob 1
# A tibble: 4 × 4
name hw proj score
<chr> <dbl> <dbl> <dbl>
1 Alice 76 184 0.935
2 Bob 72 165 0.862
3 Carol 73 195 0.944
4 Dave 75 168 0.889
# A tibble: 8 × 2
person_involved num_incidents
<chr> <int>
1 alice 2
2 bob 2
3 carole 2
4 dave 1
5 carol 1
6 Dave 1
7 Alice 1
8 Bob 1
# A tibble: 4 × 5
name hw proj score num_incidents
<chr> <dbl> <dbl> <dbl> <int>
1 Alice 76 184 0.935 1
2 Bob 72 165 0.862 1
3 Carol 73 195 0.944 NA
4 Dave 75 168 0.889 1
# A tibble: 8 × 5
name hw proj score num_incidents
<chr> <dbl> <dbl> <dbl> <int>
1 Alice 76 184 0.935 1
2 Bob 72 165 0.862 1
3 Dave 75 168 0.889 1
4 alice NA NA NA 2
5 bob NA NA NA 2
6 carole NA NA NA 2
7 dave NA NA NA 1
8 carol NA NA NA 1
# A tibble: 4 × 4
name hw proj score
<chr> <dbl> <dbl> <dbl>
1 Alice 76 184 0.935
2 Bob 72 165 0.862
3 Carol 73 195 0.944
4 Dave 75 168 0.889
# A tibble: 8 × 2
person_involved num_incidents
<chr> <int>
1 alice 2
2 bob 2
3 carole 2
4 dave 1
5 carol 1
6 Dave 1
7 Alice 1
8 Bob 1
# A tibble: 3 × 5
name hw proj score num_incidents
<chr> <dbl> <dbl> <dbl> <int>
1 Alice 76 184 0.935 1
2 Bob 72 165 0.862 1
3 Dave 75 168 0.889 1
# A tibble: 9 × 5
name hw proj score num_incidents
<chr> <dbl> <dbl> <dbl> <int>
1 Alice 76 184 0.935 1
2 Bob 72 165 0.862 1
3 Carol 73 195 0.944 NA
4 Dave 75 168 0.889 1
5 alice NA NA NA 2
6 bob NA NA NA 2
7 carole NA NA NA 2
8 dave NA NA NA 1
9 carol NA NA NA 1
# A tibble: 8 × 6
name hw proj score person_involved num_incidents
<chr> <dbl> <dbl> <dbl> <chr> <int>
1 Alice 76 184 0.935 alice 2
2 Alice 76 184 0.935 Alice 1
3 Bob 72 165 0.862 bob 2
4 Bob 72 165 0.862 Bob 1
5 Carol 73 195 0.944 carole 2
6 Carol 73 195 0.944 carol 1
7 Dave 75 168 0.889 dave 1
8 Dave 75 168 0.889 Dave 1
This is like using a chainsaw to break a twig. How could we do this in a more principled manner?
Sta 523 - Fall 2025