Tidyverse Exam v2.0

Basic operations

Question 1
Question 2
Question 3
Question 4

Cleaning and counting

Question 1
Question 2
Question 3
Question 4
Question 5

Combining data

Question 1
Question 2
Question 3
Question 4

Plotting

Question 1
Question 2
Question 3
Question 4
Question 5

Functional programming

Question 1
Question 2
Question 3
Question 4

Wrapping up

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Basic operations

more-example-exams/#basic-operations

Question 1

Read the file person.csv and store the result in a tibble called person.

person <- read_csv("https://education.rstudio.com/blog/2020/08/more-example-exams/person.csv")

## Parsed with column specification:
## cols(
##   person_id = col_character(),
##   personal_name = col_character(),
##   family_name = col_character()
## )

class(person)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Question 2

Create a tibble containing only family and personal names, in that order. You do not need to assign this tibble or any others to variables unless explicitly asked to do so. However, as noted in the introduction, you must use the pipe operator %>% and code that follows the tidyverse style guide.

# View(person)

person %>%
  select(family_name, personal_name)

ABCDEFGHIJ0123456789

family_name <chr>	personal_name <chr>
Dyer	William
Pabodie	Frank
Lake	Anderson
Roerich	Valentina
Danforth	Frank

Question 3

Create a new tibble containing only the rows in which family names come before the letter M. Your solution should work for tables with more rows than the example, i.e., you cannot rely on row numbers or select specific names.

person %>%
  arrange(family_name) %>%
  filter(family_name < "M")

ABCDEFGHIJ0123456789

person_id <chr>	personal_name <chr>	family_name <chr>
danforth	Frank	Danforth
dyer	William	Dyer
lake	Anderson	Lake

Question 4

Display all the rows in person sorted by family name length with the longest name first.

person %>%
  arrange(desc(str_length(family_name)))

ABCDEFGHIJ0123456789

person_id <chr>	personal_name <chr>	family_name <chr>
danforth	Frank	Danforth
pb	Frank	Pabodie
roe	Valentina	Roerich
dyer	William	Dyer
lake	Anderson	Lake

Cleaning and counting

more-sample-exams/#cleaning-and-counting

Question 1

Read the file measurements.csv to create a tibble called measurements. (The strings “rad”, “sal”, and “temp” in the quantity column stand for “radiation”, “salinity”, and “temperature” respectively.)

measurements <- read_csv("https://education.rstudio.com/blog/2020/08/more-example-exams/measurements.csv")

## Parsed with column specification:
## cols(
##   visit_id = col_double(),
##   visitor = col_character(),
##   quantity = col_character(),
##   reading = col_double()
## )

Question 2

Create a tibble containing only rows where none of the values are NA and save in a tibble called cleaned.

cleaned <-
measurements %>%
  filter(!is.na(visitor), !is.na(quantity), !is.na(reading))

# other option: use na.omit(measurements)

Question 3

Count the number of measurements of each type of quantity in cleaned. Your result should have one row for each quantity "rad", "sal", and "temp".

cleaned %>%
  group_by(quantity) %>%
  summarize(n())

## `summarise()` ungrouping output (override with `.groups` argument)

ABCDEFGHIJ0123456789

quantity <chr>	n() <int>
rad	8
sal	7
temp	3

# other option: use count()

Question 4

Display the minimum and maximum value of reading separately for each quantity in cleaned. Your result should have one row for each quantity "rad", "sal", and "temp".

cleaned %>%
  group_by(quantity) %>%
  summarize(min(reading), max(reading))

## `summarise()` ungrouping output (override with `.groups` argument)

ABCDEFGHIJ0123456789

quantity <chr>	min(reading) <dbl>	max(reading) <dbl>
rad	1.46	11.25
sal	0.05	41.60
temp	-21.50	-16.00

Question 5

Create a tibble in which all salinity ("sal") readings greater than 1 are divided by 100. (This is needed because some people wrote percentages as numbers from 0.0 to 1.0, but others wrote them as 0.0 to 100.0.)

measurements %>%
  filter(quantity == "sal") %>%
  mutate(new_reading = ifelse(reading > 1, reading/100, reading))

ABCDEFGHIJ0123456789

visit_id <dbl>	visitor <chr>	quantity <chr>	reading <dbl>	new_reading <dbl>
619	dyer	sal	0.13	0.130
622	dyer	sal	0.09	0.090
734	lake	sal	0.05	0.050
735	NA	sal	0.06	0.060
751	lake	sal	NA	NA
752	lake	sal	0.09	0.090
752	roe	sal	41.60	0.416
837	lake	sal	0.21	0.210
837	roe	sal	22.50	0.225

measurements %>%
  filter(quantity == "sal") %>%
  mutate(reading = reading/100)

ABCDEFGHIJ0123456789

visit_id <dbl>	visitor <chr>	quantity <chr>	reading <dbl>
619	dyer	sal	0.0013
622	dyer	sal	0.0009
734	lake	sal	0.0005
735	NA	sal	0.0006
751	lake	sal	NA
752	lake	sal	0.0009
752	roe	sal	0.4160
837	lake	sal	0.0021
837	roe	sal	0.2250

Combining data

more-sample-exams/#combining-data

Question 1

Read visited.csv and drop rows containing any NAs, assigning the result to a new tibble called visited.

visited <-
  read_csv("https://education.rstudio.com/blog/2020/08/more-example-exams/visited.csv") %>%
  filter(!is.na(site_id), !is.na(visit_date))

## Parsed with column specification:
## cols(
##   visit_id = col_double(),
##   site_id = col_character(),
##   visit_date = col_date(format = "")
## )

Question 2

Use an inner join to combine visited with cleaned using the visit_id column for matches.

inner_join(visited, cleaned, by = "visit_id")

ABCDEFGHIJ0123456789

visit_id <dbl>	site_id <chr>	visit_date <date>	visitor <chr>	quantity <chr>	reading <dbl>
619	DR-1	1927-02-08	dyer	rad	9.82
619	DR-1	1927-02-08	dyer	sal	0.13
622	DR-1	1927-02-10	dyer	rad	7.80
622	DR-1	1927-02-10	dyer	sal	0.09
734	DR-3	1930-01-07	pb	rad	8.41
734	DR-3	1930-01-07	lake	sal	0.05
734	DR-3	1930-01-07	pb	temp	-21.50
735	DR-3	1930-01-12	pb	rad	7.22
751	DR-3	1930-02-26	pb	rad	4.35
751	DR-3	1930-02-26	pb	temp	-18.50

Question 3

Find the highest radiation ("rad") reading at each site. (Sites are identified by values in the site_id column.)

inner_join(visited, cleaned, by = "visit_id") %>%
  group_by(site_id) %>%
  summarize(max(reading))

## `summarise()` ungrouping output (override with `.groups` argument)

ABCDEFGHIJ0123456789

site_id <chr>	max(reading) <dbl>
DR-1	11.25
DR-3	8.41
MSK-4	22.50

Question 4

Find the date of the highest radiation reading at each site.

inner_join(visited, cleaned, by = "visit_id") %>%
  group_by(site_id) %>%
  filter(reading == max(reading))

ABCDEFGHIJ0123456789

visit_id <dbl>	site_id <chr>	visit_date <date>	visitor <chr>	quantity <chr>	reading <dbl>
734	DR-3	1930-01-07	pb	rad	8.41
837	MSK-4	1932-01-14	roe	sal	22.50
844	DR-1	1932-03-22	roe	rad	11.25

Plotting

more-example-exams/#plotting

Question 1

The code below is supposed to read the file home-range-database.csv to create a tibble called hra_raw, but contains a bug. Describe and fix the problem. (There are several ways to fix it: please use whichever you prefer.)

hra_raw <- read_csv(here::here("data", "home-range-database.csv"))

From looking at the documentation, the here::here() function is to be considered a replacement for “filepath” within a local directory. There is no “data” or “home-range-database.csv” in my local directory, so here() can’t find it. I might fix this by moving home-range-database.csv into the data folder in my directory. Below I use the url provided for the csv.

hra_raw <- read_csv("https://education.rstudio.com/blog/2020/08/more-example-exams/home-range-database.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   mean.mass.g = col_double(),
##   log10.mass = col_double(),
##   mean.hra.m2 = col_double(),
##   log10.hra = col_double(),
##   preymass = col_double(),
##   log10.preymass = col_double(),
##   PPMR = col_double()
## )

## See spec(...) for full column specifications.

Question 2

Convert the class column (which is text) to create a factor column class_fct and assign the result to a tibble hra. Use forcats to order the factor levels as:

mammalia
reptilia
aves
actinopterygii

hra <-
hra_raw %>%
  mutate(class_fct = factor(class,
                            levels = c("mammalia", "reptilia", "aves", "actinopterygii")))

Question 3

Create a scatterplot showing the relationship between log10.mass and log10.hra in hra.

ggplot(hra, aes(x = log10.mass, y = log10.hra)) +
  geom_point()

Question 4

Colorize the points in the scatterplot by class_fct.

ggplot(hra, aes(x = log10.mass, y = log10.hra)) +
  geom_point(aes(color = class_fct))

Question 5

Display a scatterplot showing only data for birds (class aves) and fit a linear regression to that data using the lm function.

hra %>% 
  filter(class == "aves") %>%
  ggplot(aes(x = log10.mass, y = log10.hra)) +
  geom_point(aes(color = class_fct)) +
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Functional programming

more-sample-exams/#functional-programming

Question 1

Write a function called summarize_table that takes a title string and a tibble as input and returns a string that says something like, “title has # rows and # columns”. For example, summarize_table('our table', person) should return the string "our table has 5 rows and 3 columns".

summarize_table <- function(title, tibble) {
  num_rows <- nrow(tibble)
  num_cols <- ncol(tibble)
  result <- str_c(title,"has", num_rows, 
                  "rows and", num_cols, "columns", sep = " ")
  print(result)
}

summarize_table("HRA dataset", hra)

## [1] "HRA dataset has 566 rows and 25 columns"

Question 2

Write another function called show_columns that takes a string and a tibble as input and returns a string that says something like, “table has columns name, name, name”. For example, show_columns('person', person) should return the string "person has columns person_id, personal_name, family_name".

show_columns <- function(title, tibble) {
  col_names <- names(tibble)
  col_names_collapsed <- str_c(col_names, collapse = ", ")
  result <- str_c(title, "has columns", 
                  col_names_collapsed, sep = " ")
  print(result)  
}

show_columns("HRA", hra)

## [1] "HRA has columns taxon, common.name, class, order, family, genus, species, primarymethod, N, mean.mass.g, log10.mass, alternative.mass.reference, mean.hra.m2, log10.hra, hra.reference, realm, thermoregulation, locomotion, trophic.guild, dimension, preymass, log10.preymass, PPMR, prey.size.reference, class_fct"

Question 3

The function rows_from_file returns the first N rows from a table in a CSV file given the file’s name and the number of rows desired. Modify it so that if no value is specified for the number of rows, a default of 3 is used.

# https://www.r-bloggers.com/2015/08/function-argument-lists-and-missing/
# if the argument is optional
  
rows_from_file <- function(filename, num_rows = NULL){
  name <- readr::read_csv(filename)

    if (is.null(num_rows)){
      head(name, 3)
    } else {
      head(name, n = num_rows)  
    }
    #ifelse(num_rows != NA, head(n = num_rows), head(3))
}

# should show 3 rows
rows_from_file("https://education.rstudio.com/blog/2020/08/more-example-exams/measurements.csv")

## Parsed with column specification:
## cols(
##   visit_id = col_double(),
##   visitor = col_character(),
##   quantity = col_character(),
##   reading = col_double()
## )

ABCDEFGHIJ0123456789

visit_id <dbl>	visitor <chr>	quantity <chr>	reading <dbl>
619	dyer	rad	9.82
619	dyer	sal	0.13
622	dyer	rad	7.80

Question 4

The function long_name checks whether a string is longer than 4 characters. Use this function and a function from purrr to create a logical vector that contains the value TRUE where family names in the tibble person are longer than 4 characters, and FALSE where they are 4 characters or less.

    long_name <- function(name) {
      stringr::str_length(name) > 4
    }

person$family_name %>% map_lgl(long_name)

## [1] FALSE  TRUE FALSE  TRUE  TRUE

Wrapping up

more-sample-exams/#wrapping-up

Modify the YAML header of this file so that a table of contents is automatically created each time this document is knit, and fix any errors that are preventing the document from knitting cleanly.

---
title: "Tidyverse Exam Version 2.0"
output:
html_document:
    theme: flatly
---

---
title: "Tidyverse Exam Version 2.0"
output:
  html_document: # this was indented
    theme: flatly
    toc: true    # this was added
---

Tidyverse Exam v2.0

Solutions for August 2020 Sample Exam

Silvia Canelón

Basic operations

Question 1

Question 2

Question 3

Question 4

Cleaning and counting

Question 1

Question 2

Question 3

Question 4

Question 5

Combining data

Question 1

Question 2

Question 3

Question 4

Plotting

Question 1

Question 2

Question 3

Question 4

Question 5

Functional programming

Question 1

Question 2

Question 3

Question 4

Wrapping up