Working with Spreadsheets

Learning objectives

Understand the procedure for employing read.csv() to import a CSV file into R.
Familiarize with navigating and examining a data frame.
Acquire the ability to filter and extract subsets of a data frame based on specific conditions.

Now we have learned the basic operators and data structures in R, we will focus on using these learned concept in analyzing the actual data. We will be using the data collected for tomato grafting project in 2018. Although this data is simulated for the propose of this workshop, following is the basic experimental design and brief background.

Tomato grafting

Background

Tomato grafting is an ancient propagation practice in agriculture that is commonly used in vegetable production. In the case of tomato, interspecific rootstocks, where rootstock and scion belong to different species of Solanum, generally have a rootstock resistant to soilborne diseases (e.g., Fusarium wilt, Verticillium wilt, bacterial wilt, and root-knot nematodes) grafted with a scion that produces higher-quality fruit. In addition to the effectiveness of grafted plants in managing soilborne diseases, grafted plants are often more vigorous and more efficient in nutrient uptake and resource utilization, as well as resistant to abiotic stresses. Thus, plants grafted with effective and vigorous rootstocks often provide higher fruit yield and plant biomass.

Given the benefits of grafting, researcher ran experiments where three rootstock genotypes (BHN589, RST-04-106, and Maxifort) representing four different treatments, as follows:

(i). nongrafted BHN589 plants,

(ii). self-grafted BHN589 plants (plants grafted to their own rootstock),

(iii). BHN589 grafted on RST-04-106, and

(iv). BHN589 grafted on Maxifort.

The choice of BHN589 as scion was primarily based on the popularity of BHN589 for its high fruit yield and quality and long shelf life. We selected Maxifort because it is a common and popular rootstock and RST-04-106 as a new rootstock variety based on breeders’ recommendations.

Experimental design

The four graft treatments were assigned to four plots per block in a randomized complete block design. Experiments were replicated in 5 different tunnels.

You can read more about this research at Poudel et., al, 2019 or Garrett Lab

Downloading the data

Download the data, and save these csv files inside input_data folder within RiceInNepal project folder that we created earlier, on Desktop.

Loading the Data

Our input data is ready to be load and analyze with R. We can read it into R and assign it to an object by using the read.csv() function. The first argument to read.csv() is the path of the file you want to read, in quotes. This path will be relative to your current working directory, which in our case is the R Project folder. So from there, we want to access the input_data folder, and then the name of the CSV file.

yield_df <- read.csv("input_data/tomato_grafting_yield_spad.csv", header=TRUE)

# view head of the data object
head(yield_df)

##      plantID marketable_yield_kg spad_value sampling_date
## 1 T1-B1-M-P1                49.9       77.3       6/15/18
## 2 T1-B2-M-P1                47.7       79.2       6/15/18
## 3 T1-B3-M-P1                52.5       72.4       6/15/18
## 4 T1-B4-M-P1                52.5       78.5       6/15/18
## 5 T2-B1-M-P1                53.1       81.4       6/15/18
## 6 T2-B2-M-P1                53.8       79.5       6/15/18

# check the dimension of dataframe
dim(yield_df)

## [1] 160   4

# check the data types in each column of dataframe
str(yield_df)

## 'data.frame':    160 obs. of  4 variables:
##  $ plantID            : chr  "T1-B1-M-P1" "T1-B2-M-P1" "T1-B3-M-P1" "T1-B4-M-P1" ...
##  $ marketable_yield_kg: num  49.9 47.7 52.5 52.5 53.1 53.8 48.5 48.6 50.8 51.9 ...
##  $ spad_value         : num  77.3 79.2 72.4 78.5 81.4 79.5 82.1 81.5 81.9 82 ...
##  $ sampling_date      : chr  "6/15/18" "6/15/18" "6/15/18" "6/15/18" ...

Sure, here are some examples of subsetting operations for a dataframe yield_df based on different conditions:

Subsetting based on a single condition:

# Subset rows where SPAD value is greater than 50
s_g80_df <- yield_df[yield_df$spad_value > 80, ]
s_g80_df

##       plantID marketable_yield_kg spad_value sampling_date
## 5  T2-B1-M-P1                53.1       81.4       6/15/18
## 7  T2-B3-M-P1                48.5       82.1       6/15/18
## 8  T2-B4-M-P1                48.6       81.5       6/15/18
## 9  T3-B1-M-P1                50.8       81.9       6/15/18
## 10 T3-B2-M-P1                51.9       82.0       6/15/18
## 13 T4-B1-M-P1                48.1       84.9       6/15/18
## 15 T4-B3-M-P1                42.5       81.4       6/15/18
## 16 T4-B4-M-P1                49.5       80.4       6/15/18
## 17 T5-B1-M-P1                50.1       81.7       6/15/18
## 20 T5-B4-M-P1                44.1       80.2       6/15/18
## 21 T1-B1-M-P2                52.9       80.6  july-15-2018
## 23 T1-B3-M-P2                43.4       80.1  july-15-2018
## 24 T1-B4-M-P2                47.7       81.2  july-15-2018
## 26 T2-B2-M-P2                48.0       83.4  july-15-2018
## 27 T2-B3-M-P2                48.3       86.6  july-15-2018
## 28 T2-B4-M-P2                51.0       84.4  july-15-2018
## 31 T3-B3-M-P2                51.5       80.2  july-15-2018
## 32 T3-B4-M-P2                54.9       82.6  july-15-2018
## 33 T4-B1-M-P2                51.4       82.3  july-15-2018
## 36 T4-B4-M-P2                50.0       80.4  july-15-2018
## 37 T5-B1-M-P2                43.2       81.9  july-15-2018
## 38 T5-B2-M-P2                46.3       82.6  july-15-2018
## 39 T5-B3-M-P2                49.0       82.4  july-15-2018

Subsetting based on multiple conditions:

# Subset rows where SPAD value is greater than 50 and yield is less than 100
s_g80_y_g50_df <- yield_df[yield_df$spad_value > 80 & yield_df$marketable_yield_kg > 50, ]
s_g80_y_g50_df

##       plantID marketable_yield_kg spad_value sampling_date
## 5  T2-B1-M-P1                53.1       81.4       6/15/18
## 9  T3-B1-M-P1                50.8       81.9       6/15/18
## 10 T3-B2-M-P1                51.9       82.0       6/15/18
## 17 T5-B1-M-P1                50.1       81.7       6/15/18
## 21 T1-B1-M-P2                52.9       80.6  july-15-2018
## 28 T2-B4-M-P2                51.0       84.4  july-15-2018
## 31 T3-B3-M-P2                51.5       80.2  july-15-2018
## 32 T3-B4-M-P2                54.9       82.6  july-15-2018
## 33 T4-B1-M-P2                51.4       82.3  july-15-2018

Subsetting based on column names:

# Subset dataframe to select only SPAD and yield columns
yield_df_sel_sy <- yield_df[, c("spad_value", "marketable_yield_kg")]
head(yield_df_sel_sy)

##   spad_value marketable_yield_kg
## 1       77.3                49.9
## 2       79.2                47.7
## 3       72.4                52.5
## 4       78.5                52.5
## 5       81.4                53.1
## 6       79.5                53.8

Subsetting based on specific values in a column:

# Subset rows where the cultivar is 'A'
yield_df_sel_mp1 <- yield_df[yield_df$plantID == 'T1-B1-M-P1', ]
yield_df_sel_mp1

##      plantID marketable_yield_kg spad_value sampling_date
## 1 T1-B1-M-P1                49.9       77.3       6/15/18

Subsetting based on row numbers:

# Subset first 10 rows of the dataframe
yield_df_sub_r_1_5 <- yield_df[1:5, ]
yield_df_sub_r_1_5

##      plantID marketable_yield_kg spad_value sampling_date
## 1 T1-B1-M-P1                49.9       77.3       6/15/18
## 2 T1-B2-M-P1                47.7       79.2       6/15/18
## 3 T1-B3-M-P1                52.5       72.4       6/15/18
## 4 T1-B4-M-P1                52.5       78.5       6/15/18
## 5 T2-B1-M-P1                53.1       81.4       6/15/18

Subsetting columns from index 2 to 4:

yield_df_sub_c2_4 <- yield_df[, 2:4]
head(yield_df_sub_c2_4)

##   marketable_yield_kg spad_value sampling_date
## 1                49.9       77.3       6/15/18
## 2                47.7       79.2       6/15/18
## 3                52.5       72.4       6/15/18
## 4                52.5       78.5       6/15/18
## 5                53.1       81.4       6/15/18
## 6                53.8       79.5       6/15/18

Subsetting rows and columns simultaneously:

yield_df_sub_r1_5_c2_4 <- yield_df[5:10, 2:4]
yield_df_sub_r1_5_c2_4

##    marketable_yield_kg spad_value sampling_date
## 5                 53.1       81.4       6/15/18
## 6                 53.8       79.5       6/15/18
## 7                 48.5       82.1       6/15/18
## 8                 48.6       81.5       6/15/18
## 9                 50.8       81.9       6/15/18
## 10                51.9       82.0       6/15/18

How to name object variable in R?

# Good variable name
car_sales_data <- read.csv("car_sales.csv")

# Avoid using reserved words
# Avoid: data <- read.csv("car_sales.csv") 

# Use descriptive names
# Good variable name
customer_list <- read.csv("customer_data.csv")

# Use snake_case:
# Snake_case is a naming convention where words are written in lowercase letters and separated by underscores. 
# Good variable name
monthly_sales_data <- read.csv("monthly_sales.csv")

How to NOT name variable in R?

# Bad variable name - too short and unclear
x <- read.csv("car_sales.csv")

# Bad variable name - using cryptic abbreviations
csd <- read.csv("car_sales.csv")

# Bad variable name - overly long and verbose
this_is_the_variable_representing_the_dataset_containing_car_sales_data <- read.csv("car_sales.csv")

# Bad variable name - starts with a number
# Avoid: 2022_sales_data <- read.csv("sales_2022.csv")

# Bad variable name - using special characters other than underscore
# Avoid: sales-data <- read.csv("sales.csv")