Welcome & Syllabus

Lecture 01

Dr. Colin Rundel

Course Details

Course Team

Instrutor


TAs

  • Sam Rosen
  • Lynn Kremers

Course website(s)

Labs

  • Attendance is expected

  • Opportunity to work on course assignments with TA support

  • Labs will begin this week

Assessment

This course is graded 100% on your coursework (there are no exams).

We will be assessing you based on the following assignments,


Assignment Type Value n Assigned
Homeworks Team 30% 5/6 ~ Every other week
Midterms Individual 40% 2 ~ Week 6 and 14
Project Team 10% 1 ~ Week 10
Quizzes Individual 20% ~15

Teams

  • Team assignments
    • Roughly biweekly homework assignments
    • Open ended, ~5 - 15 hours of work
    • Peer evaluation after completion
  • Expectations and roles:
    • Everyone is expected to contribute equal effort
    • Everyone is expected to understand all code turned in
    • Individual contribution evaluated by peer evaluation, commits, etc.

Collaboration policy

  • Only work that is clearly assigned as team work should be completed collaboratively (Homeworks + Project).

  • Individual assignments (Midterms) must be completed individually, you may not directly share or discuss answers / code with anyone other than the myself and the TAs.

  • On Homeworks you should not directly share answers / code with other teams, however you are welcome to discuss the problems in general and ask for advice.

Sharing / reusing code / AI policy

  • We are aware that a huge volume of code is available on the web, and many tasks may have solutions posted.

  • Unless explicitly stated otherwise, this course’s policy is that you may make use of any online resources (e.g. Google, StackOverflow, etc.) but you must explicitly cite where you obtained any code you directly use or use as inspiration in your solution(s).

  • Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism, regardless of source.

  • The same applies to the use of LLM like ChatGPT, Claude, or GitHub Copilot - you are welcome to make use of these tools as the basis for your solutions but you must cite the tool when using it for significant amounts of code generation.

Brief thoughts on AI tools

  • AI tools are not a replacement for understanding the material, they can be a tool to help you understand the material.

  • Reaading code and writing code are both skills that take time and practice to develop - both are still essential skills.

  • Nature of the tools is changing rapidly - Autocomplete vs ChatBots vs Agentic

Academic integrity


To uphold the Duke Community Standard:

  • I will not lie, cheat, or steal in my academic endeavors;
  • I will conduct myself honorably in all my endeavors; and
  • I will act if the Standard is compromised.

Course Tools

Accessing RStudio Workbench

To reduce friction, the preferred method is to use the department’s RStudio server(s).

To access RStudio/Posit Workbench:

  1. Navigate to https://rstudio.stat.duke.edu
  2. Log-in with your Duke NetID and password.

DSS RStudio alternatives

If you cannot access RStudio via the DSS servers:

  • Make sure you are on authenticated Duke network (e.g. DukeBlue or VPN if off campus)

  • Make sure you are not using a custom DNS server

    • e.g. 1.1.1.1 or 8.8.8.8
  • Use a Docker container from Duke OIT

    1. Go to https://cmgr.oit.duke.edu/ and login
    2. Select Reserve a Container and find a container for Sta 313
    3. Click the link under my reservations to create your environment

Local R + RStudio

If working locally you should make sure that your environment meets the following requirements:

  • latest R (4.5.1)

  • latest RStudio (2025.05.1+513)

  • working git installation

  • ability to create ssh keys (for GitHub authentication)

  • All R packages updated to their latest version from CRAN

Support policy for local installs - we will try to help you troubleshoot if we can but reserve the right to tell you to use the dept server.

GitHub

  • We will be using an organization specifically to this course github.com/sta523-sp25

  • All assignments will be distributed and collected via GitHub

  • All of your work and your membership (enrollment) in the organization is private

  • We will be distributing a survey this weekend to collection your GitHub account names

    • Before lab you will be invited to the course organization.
  • All course related repositories will be created for you

Before Friday

In R (almost)
everything is a vector

Vectors

The fundamental building block of data in R are vectors (collections of related values, objects, etc).


R has two types of vectors (that everything is built on):

  • atomic vectors (vectors)

    • homogeneous collections of the same type (e.g. all true/false values, all numbers, or all character strings).
  • generic vectors (lists)

    • heterogeneous collections of any type of R object, even other lists (meaning they can have a hierarchical/tree-like structure).

Atomic Vectors

Atomic Vectors

R has six atomic vector types, we can check the type of any object in R using the typeof() function. mode() is a higher level abstraction used to group similar types together.

typeof() mode()
logical logical
double numeric
integer numeric
character character
complex complex
raw raw

logical - boolean values (TRUE and FALSE)

typeof(TRUE)
[1] "logical"
typeof(FALSE)
[1] "logical"
mode(TRUE)
[1] "logical"
mode(FALSE)
[1] "logical"


R will let you use T and F as shortcuts to TRUE and FALSE, this is a bad practice as these values are actually global variables that can be overwritten.

T
[1] TRUE
T = "FALSE"
T
[1] "FALSE"

character - text strings

Either single or double quotes are fine, the opening and closing quote must match.

typeof("hello")
[1] "character"
typeof('world')
[1] "character"
mode("hello")
[1] "character"
mode('world')
[1] "character"

Quote characters can be included by escaping or using a non-matching quote.

"abc'123"
[1] "abc'123"
'abc"123'
[1] "abc\"123"
"abc\"123"
[1] "abc\"123"
'abc\'123'
[1] "abc'123"

Numeric types

double - floating point values (these are the default numerical type)

typeof(1.33)
[1] "double"
typeof(7)
[1] "double"
mode(1.33)
[1] "numeric"
mode(7)
[1] "numeric"

integer - integer values (literals are indicated by an L suffix)

typeof( 7L )
[1] "integer"
typeof( 1:3 )
[1] "integer"
mode( 7L )
[1] "numeric"
mode( 1:3 )
[1] "numeric"

Combining / Concatenation

Atomic vectors can be constructed using the combine c() function.

c(1, 2, 3)
[1] 1 2 3
c("Hello", "World!")
[1] "Hello"  "World!"
c(1, 1:10)
 [1]  1  1  2  3  4  5  6  7  8  9 10
c(1,c(2, c(3)))
[1] 1 2 3

Inspecting types

  • typeof(x) - returns a character vector (length 1) of the type of object x.

  • mode(x) - returns a character vector (length 1) of the mode of object x.

typeof(1)
[1] "double"
typeof(1L)
[1] "integer"
typeof("A")
[1] "character"
typeof(TRUE)
[1] "logical"
mode(1)
[1] "numeric"
mode(1L)
[1] "numeric"
mode("A")
[1] "character"
mode(TRUE)
[1] "logical"

Type predicates

  • is.logical(x) - returns TRUE if x has type logical.
  • is.character(x) - returns TRUE if x has type character.
  • is.double(x) - returns TRUE if x has type double.
  • is.integer(x) - returns TRUE if x has type integer.
  • is.numeric(x) - returns TRUE if x has mode numeric.
is.integer(1)
[1] FALSE
is.integer(1L)
[1] TRUE
is.integer(3:7)
[1] TRUE
is.double(1)
[1] TRUE
is.double(1L)
[1] FALSE
is.double(3:8)
[1] FALSE
is.numeric(1)
[1] TRUE
is.numeric(1L)
[1] TRUE
is.numeric(3:7)
[1] TRUE

Other useful predicates

  • is.atomic(x) - returns TRUE if x is an atomic vector.
  • is.list(x) - returns TRUE if x is a list (generic vector).
  • is.vector(x) - returns TRUE if x is either an atomic or generic vector.
is.atomic(c(1,2,3))
[1] TRUE
is.list(c(1,2,3))
[1] FALSE
is.vector(c(1,2,3))
[1] TRUE
is.atomic(list(1,2,3))
[1] FALSE
is.list(list(1,2,3))
[1] TRUE
is.vector(list(1,2,3))
[1] TRUE

Type Coercion

R is a dynamically typed language – it will automatically convert between most types without raising warnings or errors. Keep in mind that atomic vectors must always contain values of the same type.

c(1, "Hello")
[1] "1"     "Hello"
c(FALSE, 3L)
[1] 0 3
c(1.2, 3L)
[1] 1.2 3.0
c(FALSE, "Hello")
[1] "FALSE" "Hello"

Operator coercion

Builtin operators and functions (e.g. +, &, log(), etc.) will generally attempt to coerce values to an appropriate type for the given operation (numeric for math, logical for logical, etc.)

3.1+1L
[1] 4.1
5 + FALSE
[1] 5
log(1)
[1] 0
log(TRUE)
[1] 0
TRUE & FALSE
[1] FALSE
TRUE & 7
[1] TRUE
TRUE | FALSE
[1] TRUE
FALSE | !5
[1] FALSE

Explicit Coercion

Most of the is functions we just saw have an as variant which can be used for explicit coercion.

as.logical(5.2)
[1] TRUE
as.character(TRUE)
[1] "TRUE"
as.integer(pi)
[1] 3
as.numeric(FALSE)
[1] 0
as.double("7.2")
[1] 7.2
as.double("one")
Warning: NAs introduced by coercion
[1] NA

Missing Values

Missing Values

R uses NA to represent missing values in its data structures, what may not be obvious is that there are different NAs for the different atomic types.

typeof(NA)
[1] "logical"
typeof(NA+1)
[1] "double"
typeof(NA+1L)
[1] "integer"
typeof(c(NA,""))
[1] "character"
typeof(NA_character_)
[1] "character"
typeof(NA_real_)
[1] "double"
typeof(NA_integer_)
[1] "integer"
typeof(NA_complex_)
[1] "complex"

NA stickiness

Because NAs represent missing values it makes sense that most calculations using them will also be missing.

1 + NA
[1] NA
1 / NA
[1] NA
NA * 5
[1] NA
sqrt(NA)
[1] NA
3^NA
[1] NA
sum(c(1, 2, 3, NA))
[1] NA

Aggregation / summarization functions (e.g. sum(), mean(), sd(), etc.) will often have a na.rm argument which drops the missing values from the calculation.

sum(c(1, 2, 3, NA), na.rm = TRUE)
[1] 6
mean(c(1, 2, 3, NA), na.rm = TRUE)
[1] 2

NAs are not always sticky

A useful mental model for NAs is to consider them as a unknown value that could take any of the possible values for a type.

For numbers or characters this isn’t very helpful, but for a logical value we know that the value must either be TRUE or FALSE and we can use that when deciding what value to return.

TRUE & NA
[1] NA
FALSE & NA
[1] FALSE
TRUE | NA
[1] TRUE
FALSE | NA
[1] NA

Other Special values (double)

These are defined as part of the IEEE floating point standard (not unique to R)

  • NaN - Not a number

  • Inf - Positive infinity

  • -Inf - Negative infinity

pi / 0
[1] Inf
0 / 0
[1] NaN
1/0 + 1/0
[1] Inf
Inf - Inf
[1] NaN
NaN / NA
[1] NA
NaN * NA
[1] NA

Testing for Inf and NaN

NaN and Inf there are convenience functions for testing for these types of values

is.finite(Inf)
[1] FALSE
is.infinite(-Inf)
[1] TRUE
is.nan(Inf)
[1] FALSE
Inf > 1
[1] TRUE
is.finite(NaN)
[1] FALSE
is.infinite(NaN)
[1] FALSE
is.nan(NaN)
[1] TRUE
-Inf > 1
[1] FALSE
is.finite(NA)
[1] FALSE
is.infinite(NA)
[1] FALSE
is.nan(NA)
[1] FALSE

Coercion for infinity and NaN

First remember that Inf, -Inf, and NaN are doubles, however their coercion behavior is not the same as other doubles

as.integer(Inf)
Warning: NAs introduced by coercion to integer range
[1] NA
as.integer(NaN)
[1] NA
as.logical(Inf)
[1] TRUE
as.logical(-Inf)
[1] TRUE
as.logical(NaN)
[1] NA
as.character(Inf)
[1] "Inf"
as.character(-Inf)
[1] "-Inf"
as.character(NaN)
[1] "NaN"

Exercise 1

Part 1

What is the type of the following vectors? Explain why they have that type.

c(1, NA+1L, "C")
c(1L / 0, NA)
c(1:3, 5)
c(3L, NaN+1L)
c(NA, TRUE)

Part 2

Considering only the four (common) data types, what is R’s implicit type conversion hierarchy (from highest priority to lowest priority)?

05:00

Logical & Comparison operators

Logical (boolean) operators


Operator Operation Vectorized?
x | y or Yes
x & y and Yes
!x not Yes
x || y or No
x && y and No
xor(x, y) exclusive or Yes

Vectorized?

x = c(TRUE,FALSE,TRUE)
y = c(FALSE,TRUE,TRUE)
x | y
[1] TRUE TRUE TRUE
x & y
[1] FALSE FALSE  TRUE
x || y
Error in x || y: 'length = 3' in coercion to 'logical(1)'
x && y
Error in x && y: 'length = 3' in coercion to 'logical(1)'
TRUE && FALSE
[1] FALSE

& and | are almost always going to be the right choice, the only time we use && or || is when you need to take advantage of short-circuit evaluation.

Vectorization and math

Almost all of the basic mathematical operations (and many other functions) in R are vectorized.

c(1, 2, 3) + c(3, 2, 1)
[1] 4 4 4
c(1, 2, 3) / c(3, 2, 1)
[1] 0.3333333 1.0000000 3.0000000
log(c(1, 3, 0))
[1] 0.000000 1.098612     -Inf
sin(c(1, 2, 3))
[1] 0.8414710 0.9092974 0.1411200

Length coercion (aka recycling)

If the lengths of the vector do not match, then the shorter vector has its values recycled to match the length of the longer vector.

x = c(TRUE, FALSE, TRUE)
y = c(TRUE)
z = c(FALSE, TRUE)
x | y
[1] TRUE TRUE TRUE
x & y
[1]  TRUE FALSE  TRUE
y | z
[1] TRUE TRUE
y & z
[1] FALSE  TRUE
x | z
Warning in x | z: longer object length is not a multiple of shorter object
length
[1] TRUE TRUE TRUE

Length coercion and math

The same length coercion rules apply for most basic mathematical operators,

x = c(1, 2, 3)
y = c(5, 4)
z = 10L
x + x
[1] 2 4 6
x + z
[1] 11 12 13
y / z
[1] 0.5 0.4
log(x)+z
[1] 10.00000 10.69315 11.09861
x %% y
Warning in x%%y: longer object length is not a multiple of shorter object
length
[1] 1 2 3

Comparison operators


Operator Comparison Vectorized?
x < y less than Yes
x > y greater than Yes
x <= y less than or equal to Yes
x >= y greater than or equal to Yes
x != y not equal to Yes
x == y equal to Yes
x %in% y contains Yes (over x)\(^*\)

Comparisons

x = c("A","B","C")
y = c("A")
x == y
[1]  TRUE FALSE FALSE
x != y
[1] FALSE  TRUE  TRUE
x %in% y
[1]  TRUE FALSE FALSE
y %in% x
[1] TRUE

Type coercion also applies for comparison opperators which can result in interesting behavior

TRUE == "TRUE"
[1] TRUE
FALSE == 1
[1] FALSE
TRUE == 1
[1] TRUE
TRUE == 5
[1] FALSE

> & < with characters

While maybe somewhat unexpected, these comparison operators can be used character values.

"A" < "B"
[1] TRUE
"A" > "B"
[1] FALSE
"A" < "a"
[1] FALSE
"a" > "!"
[1] TRUE
"Good" < "Goodbye"
[1] TRUE
c("Alice", "Bob", "Carol") <= "B"
[1]  TRUE FALSE FALSE