library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
more-example-exams/#basic-operations
Read the file person.csv
and store the result in a tibble called person
.
person <- read_csv("https://education.rstudio.com/blog/2020/08/more-example-exams/person.csv")
## Parsed with column specification:
## cols(
## person_id = col_character(),
## personal_name = col_character(),
## family_name = col_character()
## )
class(person)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Create a tibble containing only family and personal names, in that order. You do not need to assign this tibble or any others to variables unless explicitly asked to do so. However, as noted in the introduction, you must use the pipe operator %>%
and code that follows the tidyverse style guide.
# View(person)
person %>%
select(family_name, personal_name)
Create a new tibble containing only the rows in which family names come before the letter M. Your solution should work for tables with more rows than the example, i.e., you cannot rely on row numbers or select specific names.
person %>%
arrange(family_name) %>%
filter(family_name < "M")
Display all the rows in person sorted by family name length with the longest name first.
person %>%
arrange(desc(str_length(family_name)))
more-sample-exams/#cleaning-and-counting
Read the file measurements.csv to create a tibble called measurements. (The strings “rad”, “sal”, and “temp” in the quantity column stand for “radiation”, “salinity”, and “temperature” respectively.)
measurements <- read_csv("https://education.rstudio.com/blog/2020/08/more-example-exams/measurements.csv")
## Parsed with column specification:
## cols(
## visit_id = col_double(),
## visitor = col_character(),
## quantity = col_character(),
## reading = col_double()
## )
Create a tibble containing only rows where none of the values are NA
and save in a tibble called cleaned.
cleaned <-
measurements %>%
filter(!is.na(visitor), !is.na(quantity), !is.na(reading))
# other option: use na.omit(measurements)
Count the number of measurements of each type of quantity in cleaned.
Your result should have one row for each quantity "rad"
, "sal"
, and "temp"
.
cleaned %>%
group_by(quantity) %>%
summarize(n())
## `summarise()` ungrouping output (override with `.groups` argument)
# other option: use count()
Display the minimum and maximum value of reading separately for each quantity in cleaned
. Your result should have one row for each quantity "rad"
, "sal"
, and "temp"
.
cleaned %>%
group_by(quantity) %>%
summarize(min(reading), max(reading))
## `summarise()` ungrouping output (override with `.groups` argument)
Create a tibble in which all salinity ("sal"
) readings greater than 1 are divided by 100. (This is needed because some people wrote percentages as numbers from 0.0 to 1.0, but others wrote them as 0.0 to 100.0.)
measurements %>%
filter(quantity == "sal") %>%
mutate(new_reading = ifelse(reading > 1, reading/100, reading))
measurements %>%
filter(quantity == "sal") %>%
mutate(reading = reading/100)
more-sample-exams/#combining-data
Read visited.csv
and drop rows containing any NA
s, assigning the result to a new tibble called visited.
visited <-
read_csv("https://education.rstudio.com/blog/2020/08/more-example-exams/visited.csv") %>%
filter(!is.na(site_id), !is.na(visit_date))
## Parsed with column specification:
## cols(
## visit_id = col_double(),
## site_id = col_character(),
## visit_date = col_date(format = "")
## )
Use an inner join to combine visited
with cleaned
using the visit_id
column for matches.
inner_join(visited, cleaned, by = "visit_id")
Find the highest radiation ("rad"
) reading at each site. (Sites are identified by values in the site_id
column.)
inner_join(visited, cleaned, by = "visit_id") %>%
group_by(site_id) %>%
summarize(max(reading))
## `summarise()` ungrouping output (override with `.groups` argument)
Find the date of the highest radiation reading at each site.
inner_join(visited, cleaned, by = "visit_id") %>%
group_by(site_id) %>%
filter(reading == max(reading))
The code below is supposed to read the file home-range-database.csv to create a tibble called hra_raw
, but contains a bug. Describe and fix the problem. (There are several ways to fix it: please use whichever you prefer.)
hra_raw <- read_csv(here::here("data", "home-range-database.csv"))
From looking at the documentation, the
here::here()
function is to be considered a replacement for “filepath” within a local directory. There is no “data
” or “home-range-database.csv
” in my local directory, sohere()
can’t find it. I might fix this by movinghome-range-database.csv
into thedata
folder in my directory. Below I use the url provided for the csv.
hra_raw <- read_csv("https://education.rstudio.com/blog/2020/08/more-example-exams/home-range-database.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## mean.mass.g = col_double(),
## log10.mass = col_double(),
## mean.hra.m2 = col_double(),
## log10.hra = col_double(),
## preymass = col_double(),
## log10.preymass = col_double(),
## PPMR = col_double()
## )
## See spec(...) for full column specifications.
Convert the class
column (which is text) to create a factor column class_fct
and assign the result to a tibble hra.
Use forcats
to order the factor levels as:
hra <-
hra_raw %>%
mutate(class_fct = factor(class,
levels = c("mammalia", "reptilia", "aves", "actinopterygii")))
Create a scatterplot showing the relationship between log10.mass
and log10.hra
in hra.
ggplot(hra, aes(x = log10.mass, y = log10.hra)) +
geom_point()
Colorize the points in the scatterplot by class_fct.
ggplot(hra, aes(x = log10.mass, y = log10.hra)) +
geom_point(aes(color = class_fct))
Display a scatterplot showing only data for birds (class
aves) and fit a linear regression to that data using the lm
function.
hra %>%
filter(class == "aves") %>%
ggplot(aes(x = log10.mass, y = log10.hra)) +
geom_point(aes(color = class_fct)) +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
more-sample-exams/#functional-programming
Write a function called summarize_table
that takes a title string and a tibble as input and returns a string that says something like, “title has # rows and # columns”. For example, summarize_table('our table', person)
should return the string "our table has 5 rows and 3 columns"
.
summarize_table <- function(title, tibble) {
num_rows <- nrow(tibble)
num_cols <- ncol(tibble)
result <- str_c(title,"has", num_rows,
"rows and", num_cols, "columns", sep = " ")
print(result)
}
summarize_table("HRA dataset", hra)
## [1] "HRA dataset has 566 rows and 25 columns"
Write another function called show_columns
that takes a string and a tibble as input and returns a string that says something like, “table has columns name, name, name”. For example, show_columns('person', person)
should return the string "person has columns person_id, personal_name, family_name"
.
show_columns <- function(title, tibble) {
col_names <- names(tibble)
col_names_collapsed <- str_c(col_names, collapse = ", ")
result <- str_c(title, "has columns",
col_names_collapsed, sep = " ")
print(result)
}
show_columns("HRA", hra)
## [1] "HRA has columns taxon, common.name, class, order, family, genus, species, primarymethod, N, mean.mass.g, log10.mass, alternative.mass.reference, mean.hra.m2, log10.hra, hra.reference, realm, thermoregulation, locomotion, trophic.guild, dimension, preymass, log10.preymass, PPMR, prey.size.reference, class_fct"
The function rows_from_file
returns the first N
rows from a table in a CSV file given the file’s name and the number of rows desired. Modify it so that if no value is specified for the number of rows, a default of 3 is used.
# https://www.r-bloggers.com/2015/08/function-argument-lists-and-missing/
# if the argument is optional
rows_from_file <- function(filename, num_rows = NULL){
name <- readr::read_csv(filename)
if (is.null(num_rows)){
head(name, 3)
} else {
head(name, n = num_rows)
}
#ifelse(num_rows != NA, head(n = num_rows), head(3))
}
# should show 3 rows
rows_from_file("https://education.rstudio.com/blog/2020/08/more-example-exams/measurements.csv")
## Parsed with column specification:
## cols(
## visit_id = col_double(),
## visitor = col_character(),
## quantity = col_character(),
## reading = col_double()
## )
The function long_name
checks whether a string is longer than 4 characters. Use this function and a function from purrr
to create a logical vector that contains the value TRUE
where family names in the tibble person
are longer than 4 characters, and FALSE
where they are 4 characters or less.
long_name <- function(name) {
stringr::str_length(name) > 4
}
person$family_name %>% map_lgl(long_name)
## [1] FALSE TRUE FALSE TRUE TRUE
more-sample-exams/#wrapping-up
Modify the YAML header of this file so that a table of contents is automatically created each time this document is knit, and fix any errors that are preventing the document from knitting cleanly.
---
title: "Tidyverse Exam Version 2.0"
output:
html_document:
theme: flatly
---
---
title: "Tidyverse Exam Version 2.0"
output:
html_document: # this was indented
theme: flatly
toc: true # this was added
---