Tidymodels

Lecture 23

Dr. Colin Rundel

Tidymodels

library(tidymodels)

── Attaching packages ─────────────────────────────────────────── tidymodels 1.4.1 ──

✔ broom        1.0.10     ✔ rsample      1.3.1 
✔ dials        1.4.2      ✔ tailor       0.1.0 
✔ infer        1.0.9      ✔ tune         2.0.1 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.3.3      ✔ workflowsets 1.1.1 
✔ recipes      1.3.1      ✔ yardstick    1.3.2

── Conflicts ────────────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard()   masks purrr::discard()
✖ dplyr::filter()     masks stats::filter()
✖ recipes::fixed()    masks stringr::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ rsample::populate() masks Rcpp::populate()
✖ yardstick::spec()   masks readr::spec()
✖ recipes::step()     masks stats::step()

Book data

(books = DAAG::allbacks |>
  as_tibble() |>
  select(-area) |>
  mutate(
    cover = forcats::fct_recode(
      cover, 
      "hardback" = "hb", 
      "paperback" = "pb"
    )
  )
)

# A tibble: 15 × 3
   volume weight cover    
    <dbl>  <dbl> <fct>    
 1    885    800 hardback 
 2   1016    950 hardback 
 3   1125   1050 hardback 
 4    239    350 hardback 
 5    701    750 hardback 
 6    641    600 hardback 
 7   1228   1075 hardback 
 8    412    250 paperback
 9    953    700 paperback
10    929    650 paperback
11   1492    975 paperback
12    419    350 paperback
13   1010    950 paperback
14    595    425 paperback
15   1034    725 paperback

ggplot(books, aes(x=volume, y=weight, color = cover)) +
  geom_point(size=2)

Building a tidymodel

linear_reg()

Linear Regression Model Specification (regression)

Computational engine: lm

linear_reg() |>
  set_engine("lm")

Linear Regression Model Specification (regression)

Computational engine: lm

Building a tidymodel

linear_reg() |>
  set_engine("lm") |>
  fit(weight ~ volume * cover, data=books)

parsnip model object


Call:
stats::lm(formula = weight ~ volume * cover, data = data)

Coefficients:
          (Intercept)                 volume  
            161.58654                0.76159  
       coverpaperback  volume:coverpaperback  
           -120.21407               -0.07573

lm(weight ~ volume * cover, data=books)


Call:
lm(formula = weight ~ volume * cover, data = books)

Coefficients:
          (Intercept)                 volume  
            161.58654                0.76159  
       coverpaperback  volume:coverpaperback  
           -120.21407               -0.07573

Tidy model objects

lm_tm = linear_reg() |>
  set_engine("lm") |>
  fit(weight ~ volume * cover, 
      data = books)

lm_b = lm(weight ~ volume * cover, data = books)

summary(lm_tm)

             Length Class      Mode   
lvl           0     -none-     NULL   
ordered       1     -none-     logical
spec          7     linear_reg list   
fit          13     lm         list   
preproc       1     -none-     list   
elapsed       2     -none-     list   
censor_probs  0     -none-     list

summary(lm_b)


Call:
lm(formula = weight ~ volume * cover, data = books)

Residuals:
   Min     1Q Median     3Q    Max 
-89.67 -32.07 -21.82  17.94 215.91 

Coefficients:
                        Estimate Std. Error t value
(Intercept)            161.58654   86.51918   1.868
volume                   0.76159    0.09718   7.837
coverpaperback        -120.21407  115.65899  -1.039
volume:coverpaperback   -0.07573    0.12802  -0.592
                      Pr(>|t|)    
(Intercept)             0.0887 .  
volume                7.94e-06 ***
coverpaperback          0.3209    
volume:coverpaperback   0.5661    
---
Signif. codes:  
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 80.41 on 11 degrees of freedom
Multiple R-squared:  0.9297,    Adjusted R-squared:  0.9105 
F-statistic:  48.5 on 3 and 11 DF,  p-value: 1.245e-06

summary(lm_tm$fit)


Call:
stats::lm(formula = weight ~ volume * cover, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-89.67 -32.07 -21.82  17.94 215.91 

Coefficients:
                        Estimate Std. Error t value
(Intercept)            161.58654   86.51918   1.868
volume                   0.76159    0.09718   7.837
coverpaperback        -120.21407  115.65899  -1.039
volume:coverpaperback   -0.07573    0.12802  -0.592
                      Pr(>|t|)    
(Intercept)             0.0887 .  
volume                7.94e-06 ***
coverpaperback          0.3209    
volume:coverpaperback   0.5661    
---
Signif. codes:  
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 80.41 on 11 degrees of freedom
Multiple R-squared:  0.9297,    Adjusted R-squared:  0.9105 
F-statistic:  48.5 on 3 and 11 DF,  p-value: 1.245e-06

Tidy coefficients

broom::tidy(lm_tm)

# A tibble: 4 × 5
  term                   estimate std.error statistic    p.value
  <chr>                     <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)            162.       86.5        1.87  0.0887    
2 volume                   0.762     0.0972     7.84  0.00000794
3 coverpaperback        -120.      116.        -1.04  0.321     
4 volume:coverpaperback   -0.0757    0.128     -0.592 0.566

broom::tidy(lm_b)

# A tibble: 4 × 5
  term                   estimate std.error statistic    p.value
  <chr>                     <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)            162.       86.5        1.87  0.0887    
2 volume                   0.762     0.0972     7.84  0.00000794
3 coverpaperback        -120.      116.        -1.04  0.321     
4 volume:coverpaperback   -0.0757    0.128     -0.592 0.566

Tidy statistics

broom::glance(lm_tm)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC deviance
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>
1     0.930         0.911  80.4      48.5   1.24e-6     3  -84.8  180.  183.   71118.
# ℹ 2 more variables: df.residual <int>, nobs <int>

broom::glance(lm_b)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC deviance
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>
1     0.930         0.911  80.4      48.5   1.24e-6     3  -84.8  180.  183.   71118.
# ℹ 2 more variables: df.residual <int>, nobs <int>

Tidy prediction

broom::augment(lm_tm, new_data = books)

# A tibble: 15 × 5
   .pred .resid volume weight cover    
   <dbl>  <dbl>  <dbl>  <dbl> <fct>    
 1  836. -35.6     885    800 hardback 
 2  935.  14.6    1016    950 hardback 
 3 1018.  31.6    1125   1050 hardback 
 4  344.   6.39    239    350 hardback 
 5  695.  54.5     701    750 hardback 
 6  650. -49.8     641    600 hardback 
 7 1097. -21.8    1228   1075 hardback 
 8  324. -73.9     412    250 paperback
 9  695.   5.00    953    700 paperback
10  679. -28.5     929    650 paperback
11 1065. -89.7    1492    975 paperback
12  329.  21.3     419    350 paperback
13  734. 216.     1010    950 paperback
14  449. -24.5     595    425 paperback
15  751. -25.6    1034    725 paperback

Putting it together

lm_tm |>
  augment(
    new_data = tidyr::expand_grid(
      volume = seq(0, 1500, by=5),
      cover = c("hardback", "paperback") |> as.factor()
    )
  ) |>
  rename(weight = .pred) |>
  ggplot(aes(x = volume, y = weight, color = cover, group = cover)) +
    geom_line() +
    geom_point(data = books)

Models?

Parsnip models

tidymodels.org/find/parsnip/

Other linear regression engines

show_engines("linear_reg")

# A tibble: 8 × 2
  engine   mode               
  <chr>    <chr>              
1 lm       regression         
2 glm      regression         
3 glmnet   regression         
4 stan     regression         
5 spark    regression         
6 keras    regression         
7 brulee   regression         
8 quantreg quantile regression

(bayes_tm = linear_reg() |> 
  set_engine(
    "stan", 
    prior_intercept = rstanarm::student_t(df = 1), 
    prior = rstanarm::student_t(df = 1),
    seed = 1234
  ) 
)

Linear Regression Model Specification (regression)

Engine-Specific Arguments:
  prior_intercept = rstanarm::student_t(df = 1)
  prior = rstanarm::student_t(df = 1)
  seed = 1234

Computational engine: stan

Fitting with `rstanarm`

(bayes_tm = bayes_tm |>
  fit(weight ~ volume * cover, data = books)
)

Warning: There were 36 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.

Warning: Examine the pairs() plot to diagnose sampling problems

parsnip model object

stan_glm
 family:       gaussian [identity]
 formula:      weight ~ volume * cover
 observations: 15
 predictors:   4
------
                      Median MAD_SD
(Intercept)           94.9   60.4  
volume                 0.8    0.1  
coverpaperback        -0.4    3.6  
volume:coverpaperback -0.2    0.0  

Auxiliary parameter(s):
      Median MAD_SD
sigma 84.9   17.6  

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

What was actually run?

linear_reg() |> 
  set_engine(
    "stan", 
    prior_intercept = rstanarm::student_t(df = 1), 
    prior = rstanarm::student_t(df = 1),
    seed = 1234
  ) |>
  translate()

Linear Regression Model Specification (regression)

Engine-Specific Arguments:
  prior_intercept = rstanarm::student_t(df = 1)
  prior = rstanarm::student_t(df = 1)
  seed = 1234

Computational engine: stan 

Model fit template:
rstanarm::stan_glm(formula = missing_arg(), data = missing_arg(), 
    weights = missing_arg(), prior_intercept = rstanarm::student_t(df = 1), 
    prior = rstanarm::student_t(df = 1), seed = 1234, family = stats::gaussian, 
    refresh = 0)

Back to broom

broom::tidy(bayes_tm)

Error in `generics::tidy()`:
! `x` seems to be outputted from the rstanarm package.
ℹ Tidiers for mixed model output now live in broom.mixed.

broom.mixed::tidy(bayes_tm)

# A tibble: 4 × 3
  term                  estimate std.error
  <chr>                    <dbl>     <dbl>
1 (Intercept)             94.9     60.4   
2 volume                   0.834    0.0734
3 coverpaperback          -0.354    3.58  
4 volume:coverpaperback   -0.199    0.0498

broom.mixed::glance(bayes_tm)

# A tibble: 1 × 4
  algorithm   pss  nobs sigma
  <chr>     <dbl> <int> <dbl>
1 sampling   4000    15  84.9

Augment

augment(bayes_tm, new_data=books)

# A tibble: 15 × 5
   .pred  .resid volume weight cover    
   <dbl>   <dbl>  <dbl>  <dbl> <fct>    
 1  831.  -30.7     885    800 hardback 
 2  940.   10.3    1016    950 hardback 
 3 1030.   19.7    1125   1050 hardback 
 4  293.   56.5     239    350 hardback 
 5  678.   72.3     701    750 hardback 
 6  628.  -27.8     641    600 hardback 
 7 1116.  -41.0    1228   1075 hardback 
 8  354. -104.      412    250 paperback
 9  696.    4.18    953    700 paperback
10  681.  -30.6     929    650 paperback
11 1037.  -61.8    1492    975 paperback
12  358.   -8.00    419    350 paperback
13  732.  218.     1010    950 paperback
14  469.  -44.3     595    425 paperback
15  747.  -22.1    1034    725 paperback

Predictions

bayes_tm |>
  augment(
    new_data = tidyr::expand_grid(
      volume = seq(0, 1500, by=5),
      cover = c("hardback", "paperback") |> as.factor()
    )
  ) |>
  rename(weight = .pred) |>
  ggplot(aes(x = volume, y = weight, color = cover, group = cover)) +
    geom_line() +
    geom_point(data = books)

Performance

lm_tm |>
  augment(new_data = books) |>
  yardstick::rmse(weight, .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        68.9

bayes_tm |>
  augment(new_data = books) |>
  yardstick::rmse(weight, .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard        72.1

lm_tm |>
  augment(new_data = books) |>
  yardstick::mae(weight, .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard        46.6

bayes_tm |>
  augment(new_data = books) |>
  yardstick::mae(weight, .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard        50.1

Cross validation and
Feature engineering

The Office & IMDB

(office_ratings = read_csv("data/office_ratings.csv"))

# A tibble: 188 × 6
   season episode title             imdb_rating total_votes air_date  
    <dbl>   <dbl> <chr>                   <dbl>       <dbl> <date>    
 1      1       1 Pilot                     7.6        3706 2005-03-24
 2      1       2 Diversity Day             8.3        3566 2005-03-29
 3      1       3 Health Care               7.9        2983 2005-04-05
 4      1       4 The Alliance              8.1        2886 2005-04-12
 5      1       5 Basketball                8.4        3179 2005-04-19
 6      1       6 Hot Girl                  7.8        2852 2005-04-26
 7      2       1 The Dundies               8.7        3213 2005-09-20
 8      2       2 Sexual Harassment         8.2        2736 2005-09-27
 9      2       3 Office Olympics           8.4        2742 2005-10-04
10      2       4 The Fire                  8.4        2713 2005-10-11
# ℹ 178 more rows

Rating vs Air Date

Test-train split

set.seed(123)
(office_split = initial_split(office_ratings, prop = 0.8))

<Training/Testing/Total>
<150/38/188>

(office_train = training(office_split))

# A tibble: 150 × 6
   season episode title    imdb_rating total_votes
    <dbl>   <dbl> <chr>          <dbl>       <dbl>
 1      8      18 Last Da…         7.8        1429
 2      9      14 Vandali…         7.6        1402
 3      2       8 Perform…         8.2        2416
 4      9       5 Here Co…         7.1        1515
 5      3      22 Beach G…         9.1        2783
 6      7       1 Nepotism         8.4        1897
 7      3      15 Phyllis…         8.3        2283
 8      9      21 Livin' …         8.9        2041
 9      9      18 Promos           8          1445
10      8      12 Pool Pa…         8          1612
# ℹ 140 more rows
# ℹ 1 more variable: air_date <date>

(office_test = testing(office_split))

# A tibble: 38 × 6
   season episode title    imdb_rating total_votes
    <dbl>   <dbl> <chr>          <dbl>       <dbl>
 1      1       2 Diversi…         8.3        3566
 2      2       4 The Fire         8.4        2713
 3      2       9 E-Mail …         8.4        2527
 4      2      12 The Inj…         9          3282
 5      2      22 Casino …         9.3        3644
 6      3       5 Initiat…         8.2        2254
 7      3      16 Busines…         8.8        2622
 8      3      17 Cocktai…         8.5        2264
 9      4       6 Branch …         8.5        2185
10      4       7 Survivo…         8.3        2110
# ℹ 28 more rows
# ℹ 1 more variable: air_date <date>

Feature engineering with dplyr

office_train |>
  mutate(
    season = as_factor(season),
    month = lubridate::month(air_date),
    wday = lubridate::wday(air_date),
    top10_votes = as.integer(total_votes > quantile(total_votes, 0.9))
  )

# A tibble: 150 × 9
   season episode title               imdb_rating total_votes air_date   month  wday top10_votes
   <fct>    <dbl> <chr>                     <dbl>       <dbl> <date>     <dbl> <dbl>       <int>
 1 8           18 Last Day in Florida         7.8        1429 2012-03-08     3     5           0
 2 9           14 Vandalism                   7.6        1402 2013-01-31     1     5           0
 3 2            8 Performance Review          8.2        2416 2005-11-15    11     3           0
 4 9            5 Here Comes Treble           7.1        1515 2012-10-25    10     5           0
 5 3           22 Beach Games                 9.1        2783 2007-05-10     5     5           0
 6 7            1 Nepotism                    8.4        1897 2010-09-23     9     5           0
 7 3           15 Phyllis' Wedding            8.3        2283 2007-02-08     2     5           0
 8 9           21 Livin' the Dream            8.9        2041 2013-05-02     5     5           0
 9 9           18 Promos                      8          1445 2013-04-04     4     5           0
10 8           12 Pool Party                  8          1612 2012-01-19     1     5           0
# ℹ 140 more rows

Anyone see a potential problem with the code above?

Better living through recipes

r = recipe(imdb_rating ~ ., data = office_train)

summary(r)

# A tibble: 6 × 4
  variable    type      role      source  
  <chr>       <list>    <chr>     <chr>   
1 season      <chr [2]> predictor original
2 episode     <chr [2]> predictor original
3 title       <chr [3]> predictor original
4 total_votes <chr [2]> predictor original
5 air_date    <chr [1]> predictor original
6 imdb_rating <chr [2]> outcome   original

Recipe roles

r = recipe(
  imdb_rating ~ ., data = office_train
) |> 
  update_role(title, new_role = "ID")

summary(r)

# A tibble: 6 × 4
  variable    type      role      source  
  <chr>       <list>    <chr>     <chr>   
1 season      <chr [2]> predictor original
2 episode     <chr [2]> predictor original
3 title       <chr [3]> ID        original
4 total_votes <chr [2]> predictor original
5 air_date    <chr [1]> predictor original
6 imdb_rating <chr [2]> outcome   original

Adding features (month & day of week)

r = recipe(
  imdb_rating ~ ., data = office_train
) |> 
  update_role(title, new_role = "ID") |>
  step_date(air_date, features = c("dow", "month"))

Adding Holidays

r = recipe(
  imdb_rating ~ ., data = office_train
) |> 
  update_role(title, new_role = "ID") |>
  step_date(air_date, features = c("dow", "month")) |>
  step_holiday(
    air_date, 
    holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), 
    keep_original_cols = FALSE
  )

Seasons as factors

r = recipe(
  imdb_rating ~ ., data = office_train
) |> 
  update_role(title, new_role = "ID") |>
  step_date(air_date, features = c("dow", "month")) |>
  step_holiday(
    air_date, 
    holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), 
    keep_original_cols = FALSE
  ) |>
  step_num2factor(season, levels = as.character(1:9))

Dummy coding

r = recipe(
  imdb_rating ~ ., data = office_train
) |> 
  update_role(title, new_role = "ID") |>
  step_date(air_date, features = c("dow", "month")) |>
  step_holiday(
    air_date, 
    holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), 
    keep_original_cols = FALSE
  ) |>
  step_num2factor(season, levels = as.character(1:9)) |>
  step_dummy(all_nominal_predictors())

`top10_votes`

r = recipe(
  imdb_rating ~ ., data = office_train
) |> 
  update_role(title, new_role = "ID") |>
  step_date(air_date, features = c("dow", "month")) |>
  step_holiday(
    air_date, 
    holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), 
    keep_original_cols = FALSE
  ) |>
  step_num2factor(season, levels = as.character(1:9)) |>
  step_dummy(all_nominal_predictors()) |>
  step_percentile(total_votes) |>
  step_mutate(top10 = as.integer(total_votes >= 0.9)) |>
  step_rm(total_votes)

Preparing a recipe

prep(r)

── Recipe ──────────────────────────────────────────────────────────────────────────────────────────

── Inputs

Number of variables by role

outcome:   1
predictor: 4
ID:        1

── Training information

Training data contained 150 data points and no incomplete rows.

── Operations

• Date features from: air_date | Trained

• Holiday features from: air_date | Trained

• Factor variables from: season | Trained

• Dummy variables from: season, air_date_dow, air_date_month | Trained

• Percentile transformation on: total_votes | Trained

• Variable mutation for: ~as.integer(total_votes >= 0.9) | Trained

• Variables removed: total_votes | Trained

Baking a recipe

prep(r) |>
  bake(new_data = office_train)

# A tibble: 150 × 33
   episode title    imdb_rating air_date_USThanksgiv…¹ air_date_USChristmas…² air_date_USNewYearsDay
     <dbl> <chr>          <dbl>                  <int>                  <int>                  <int>
 1      18 Last Da…         7.8                      0                      0                      0
 2      14 Vandali…         7.6                      0                      0                      0
 3       8 Perform…         8.2                      0                      0                      0
 4       5 Here Co…         7.1                      0                      0                      0
 5      22 Beach G…         9.1                      0                      0                      0
 6       1 Nepotism         8.4                      0                      0                      0
 7      15 Phyllis…         8.3                      0                      0                      0
 8      21 Livin' …         8.9                      0                      0                      0
 9      18 Promos           8                        0                      0                      0
10      12 Pool Pa…         8                        0                      0                      0
# ℹ 140 more rows
# ℹ abbreviated names: ¹air_date_USThanksgivingDay, ²air_date_USChristmasDay
# ℹ 27 more variables: air_date_USIndependenceDay <int>, season_X2 <dbl>, season_X3 <dbl>,
#   season_X4 <dbl>, season_X5 <dbl>, season_X6 <dbl>, season_X7 <dbl>, season_X8 <dbl>,
#   season_X9 <dbl>, air_date_dow_Mon <dbl>, air_date_dow_Tue <dbl>, air_date_dow_Wed <dbl>,
#   air_date_dow_Thu <dbl>, air_date_dow_Fri <dbl>, air_date_dow_Sat <dbl>,
#   air_date_month_Feb <dbl>, air_date_month_Mar <dbl>, air_date_month_Apr <dbl>, …

Informative features?

prep(r) |>
  bake(new_data = office_train) |>
  map_int(~ length(unique(.x)))

                   episode                      title                imdb_rating 
                        26                        150                         26 
air_date_USThanksgivingDay    air_date_USChristmasDay     air_date_USNewYearsDay 
                         1                          1                          1 
air_date_USIndependenceDay                  season_X2                  season_X3 
                         1                          2                          2 
                 season_X4                  season_X5                  season_X6 
                         2                          2                          2 
                 season_X7                  season_X8                  season_X9 
                         2                          2                          2 
          air_date_dow_Mon           air_date_dow_Tue           air_date_dow_Wed 
                         1                          2                          1 
          air_date_dow_Thu           air_date_dow_Fri           air_date_dow_Sat 
                         2                          1                          1 
        air_date_month_Feb         air_date_month_Mar         air_date_month_Apr 
                         2                          2                          2 
        air_date_month_May         air_date_month_Jun         air_date_month_Jul 
                         2                          1                          1 
        air_date_month_Aug         air_date_month_Sep         air_date_month_Oct 
                         1                          2                          2 
        air_date_month_Nov         air_date_month_Dec                      top10 
                         2                          2                          2

Removing zero variance predictors

r = recipe(
    imdb_rating ~ ., data = office_train
  ) |> 
  update_role(title, new_role = "ID") |>
  step_date(air_date, features = c("dow", "month")) |>
  step_holiday(
    air_date, 
    holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), 
    keep_original_cols = FALSE
  ) |>
  step_num2factor(season, levels = as.character(1:9)) |>
  step_dummy(all_nominal_predictors()) |>
  step_percentile(total_votes) |>
  step_mutate(top10 = as.integer(total_votes >= 0.9)) |>
  step_rm(total_votes) |>
  step_zv(all_predictors())

prep(r) |>
  bake(new_data = office_train)

# A tibble: 150 × 22
   episode title   imdb_rating season_X2 season_X3 season_X4 season_X5 season_X6 season_X7 season_X8
     <dbl> <chr>         <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1      18 Last D…         7.8         0         0         0         0         0         0         1
 2      14 Vandal…         7.6         0         0         0         0         0         0         0
 3       8 Perfor…         8.2         1         0         0         0         0         0         0
 4       5 Here C…         7.1         0         0         0         0         0         0         0
 5      22 Beach …         9.1         0         1         0         0         0         0         0
 6       1 Nepoti…         8.4         0         0         0         0         0         1         0
 7      15 Phylli…         8.3         0         1         0         0         0         0         0
 8      21 Livin'…         8.9         0         0         0         0         0         0         0
 9      18 Promos          8           0         0         0         0         0         0         0
10      12 Pool P…         8           0         0         0         0         0         0         1
# ℹ 140 more rows
# ℹ 12 more variables: season_X9 <dbl>, air_date_dow_Tue <dbl>, air_date_dow_Thu <dbl>,
#   air_date_month_Feb <dbl>, air_date_month_Mar <dbl>, air_date_month_Apr <dbl>,
#   air_date_month_May <dbl>, air_date_month_Sep <dbl>, air_date_month_Oct <dbl>,
#   air_date_month_Nov <dbl>, air_date_month_Dec <dbl>, top10 <int>

Really putting it all together

(office_work = workflow() |>
  add_recipe(r) |>
  add_model(
    linear_reg() |>
    set_engine("lm")
  )
)

══ Workflow ════════════════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────────────────────────
8 Recipe Steps

• step_date()
• step_holiday()
• step_num2factor()
• step_dummy()
• step_percentile()
• step_mutate()
• step_rm()
• step_zv()

── Model ───────────────────────────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)

Computational engine: lm

Workflow fit

(office_fit = office_work |>
  fit(data = office_train))

══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()

── Preprocessor ────────────────────────────────────────────────────────────────────────────────────
8 Recipe Steps

• step_date()
• step_holiday()
• step_num2factor()
• step_dummy()
• step_percentile()
• step_mutate()
• step_rm()
• step_zv()

── Model ───────────────────────────────────────────────────────────────────────────────────────────

Call:
stats::lm(formula = ..y ~ ., data = data)

Coefficients:
       (Intercept)             episode           season_X2           season_X3           season_X4  
          7.342550            0.007041            1.216204            1.411946            1.433737  
         season_X5           season_X6           season_X7           season_X8           season_X9  
          1.353140            1.150998            1.218278            0.526168            0.789066  
  air_date_dow_Tue    air_date_dow_Thu  air_date_month_Feb  air_date_month_Mar  air_date_month_Apr  
         -0.286676           -0.362900           -0.077280           -0.048725            0.015377  
air_date_month_May  air_date_month_Sep  air_date_month_Oct  air_date_month_Nov  air_date_month_Dec  
          0.144080            0.092809           -0.047704           -0.077198            0.306999  
             top10  
          0.890058

Performance

office_fit |>
  augment(office_train) |>
  rmse(imdb_rating, .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.342

office_fit |>
  augment(office_test) |>
  rmse(imdb_rating, .pred)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard       0.399

k-fold cross validation

Creating folds

set.seed(123)
(folds = vfold_cv(office_train, v=5))

#  5-fold cross-validation 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [120/30]> Fold1
2 <split [120/30]> Fold2
3 <split [120/30]> Fold3
4 <split [120/30]> Fold4
5 <split [120/30]> Fold5

(office_fit_folds = office_work |>
  fit_resamples(folds)
)

→ A | warning: prediction from rank-deficient fit; consider predict(., rankdeficient="NA")

There were issues with some computations   A: x1
There were issues with some computations   A: x1

# Resampling results
# 5-fold cross-validation 
# A tibble: 5 × 4
  splits           id    .metrics         .notes          
  <list>           <chr> <list>           <list>          
1 <split [120/30]> Fold1 <tibble [2 × 4]> <tibble [0 × 4]>
2 <split [120/30]> Fold2 <tibble [2 × 4]> <tibble [1 × 4]>
3 <split [120/30]> Fold3 <tibble [2 × 4]> <tibble [0 × 4]>
4 <split [120/30]> Fold4 <tibble [2 × 4]> <tibble [0 × 4]>
5 <split [120/30]> Fold5 <tibble [2 × 4]> <tibble [0 × 4]>

There were issues with some computations:

  - Warning(s) x1: prediction from rank-deficient fit; consider predict...

Run `show_notes(.Last.tune.result)` for more information.

Fold performance

tune::collect_metrics(office_fit_folds)

# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard   0.420     5  0.0182 pre0_mod0_post0
2 rsq     standard   0.429     5  0.0597 pre0_mod0_post0

tune::collect_metrics(office_fit_folds, summarize = FALSE) |>
  filter(.metric == "rmse")

# A tibble: 5 × 5
  id    .metric .estimator .estimate .config        
  <chr> <chr>   <chr>          <dbl> <chr>          
1 Fold1 rmse    standard       0.467 pre0_mod0_post0
2 Fold2 rmse    standard       0.403 pre0_mod0_post0
3 Fold3 rmse    standard       0.405 pre0_mod0_post0
4 Fold4 rmse    standard       0.454 pre0_mod0_post0
5 Fold5 rmse    standard       0.368 pre0_mod0_post0

Tidymodels

Tidymodels

Book data

Building a tidymodel

Building a tidymodel

Tidy model objects

Tidy coefficients

Tidy statistics

Tidy prediction

Putting it together

Models?

Parsnip models

Other linear regression engines

Fitting with rstanarm

What was actually run?

Back to broom

Augment

Predictions

Performance

Cross validation andFeature engineering

The Office & IMDB

Rating vs Air Date

Test-train split

Feature engineering with dplyr

Better living through recipes

Recipe roles

Adding features (month & day of week)

Adding Holidays

Seasons as factors

Dummy coding

top10_votes

Preparing a recipe

Baking a recipe

Informative features?

Removing zero variance predictors

Really putting it all together

Workflow fit

Performance

k-fold cross validation

Creating folds

Fold performance

Fitting with `rstanarm`

Cross validation and
Feature engineering

`top10_votes`