Home > database >  Data Wrangling for Modeling in R
Data Wrangling for Modeling in R

Time:12-15

I have a data set (original version, # A tibble: 33,478 x 12) of the form similar to the attached picture, and partial data:

enter image description here

dput(head(canals2, n=10))

 structure(list(Site = c(1, 2, 4, 11, 10, 12, 13, 14, 15, 16), 
    `Sample Date` = c("2/11/2004", "2/11/2004", "2/11/2004", 
    "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", 
    "2/11/2004", "2/11/2004"), `Analysis code` = c("NH3", "NH3", 
    "NH3", "Chl a", "Chl a", "Chl a", "NH3", "Chl a", "NH3", 
    "NH3"), Analysis = c("Ammonia-Nitrogen", "Ammonia-Nitrogen", 
    "Ammonia-Nitrogen", "Chlorophyll a", "Chlorophyll a", "Chlorophyll a", 
    "Ammonia-Nitrogen", "Chlorophyll a", "Ammonia-Nitrogen", 
    "Ammonia-Nitrogen"), Result = c(0.068, 0.07, 0.014, 1.31, 
    1.39, 1.95, 0.247, 1.46, 0.113, 0.17), Units = c("mg/L", 
    "mg/L", "mg/L", "mg/m3", "mg/m3", "mg/m3", "mg/L", "mg/m3", 
    "mg/L", "mg/L")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

I would like to try to predict, for instance, "Chlorophyll a" from "Ammonia-Nitrogen" using a linear model (say, using the lm() function). lm() takes column names as inputs for 'formula', yet this dataset was generated very differently. I should be using the values in the Results column for each analysis, but I can't seem to find a good way of organizing my data as such.

So far, I tried splitting the data by analysis, with the intention of creating a new dataframe for each analysis, then replacing the Result with the name of the analysis selected in that dataframe. Here is the function I used (ran it on the main data set, which is why it contains more analysis names):

analysis_list = unique(canals$Analysis)

> analysis_list

1 "Ammonia-Nitrogen" "Chlorophyll a" "Fecal Coliform"
[4] "Specific Conductance" "Copper" "Dissolved Oxygen"
[7] "E Coli" "Enterococci" "Nitrite Nitrate"
[10] "Ortho-Phosphate" "pH" "Salinity"
[13] "Temperature" "Total Kjeldahl Nitrogen" "Total Nitrogen"
[16] "Total Phosphorus" "Turbidity"

split_an <- function() {
  my_list <- vector(mode = "list", length = 0)
  for (i in 1:17) {
    analysis_var = analysis_list[i]
    my_var <- canals %>% filter(Analysis == analysis_var)
    my_list[[i]] = my_var
    
  }
}

split_an()  

Unfortunately, that didn't work as expected, and I had many issues merging the tables I created. I tried other ways as well, but I am getting nowhere. Is anyone willing to offer any suggestions?

CodePudding user response:

If I understand correctly, then it sounds like you're trying to restructure your data to get it into the proper form for modelling purposes. I think using pivot_wider (from tidyr) will get you what you want. Here's what I did:

First, here's your data as a dataframe:

Site <- c(1, 2, 4, 11, 10, 12, 13, 14, 15, 16)
Sample_Date <- c("2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", 
"2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004")
Analysis_code <- c("NH3", "NH3", "NH3", "Chl a", "Chl a", "Chl a", "NH3", "Chl 
a", "NH3", "NH3")
Analysis <- c("Ammonia-Nitrogen", "Ammonia-Nitrogen", "Ammonia-Nitrogen", 
"Chlorophyll a", "Chlorophyll a", "Chlorophyll a", "Ammonia-Nitrogen", 
"Chlorophyll a", "Ammonia-Nitrogen", "Ammonia-Nitrogen")
Results <- c(0.068, 0.07, 0.014, 1.31, 1.39, 1.95, 0.247, 1.46, 0.113, 0.17)
Units <- c("mg/L", "mg/L", "mg/L", "mg/m3", "mg/m3", "mg/m3", "mg/L", "mg/m3", 
"mg/L", "mg/L")

   Site Sample_Date Analysis_code         Analysis Results Units
1     1   2/11/2004           NH3 Ammonia-Nitrogen   0.068  mg/L
2     2   2/11/2004           NH3 Ammonia-Nitrogen   0.070  mg/L
3     4   2/11/2004           NH3 Ammonia-Nitrogen   0.014  mg/L
4    11   2/11/2004         Chl a    Chlorophyll a   1.310 mg/m3
5    10   2/11/2004         Chl a    Chlorophyll a   1.390 mg/m3

Next, we'll apply pivot_wider to spread the Analysis variable. This will leave you with a column for each of the Analysis types you have, along with their respective Results value.

#spread the analysis variable
new_df <- df %>%
  pivot_wider(names_from = "Analysis", values_from = "Results")

    Site Sample_Date Analysis_code Units `Ammonia-Nitrogen` `Chlorophyll a`
   <dbl> <chr>       <chr>         <chr>              <dbl>           <dbl>
 1     1 2/11/2004   NH3           mg/L               0.068           NA   
 2     2 2/11/2004   NH3           mg/L               0.07            NA   
 3     4 2/11/2004   NH3           mg/L               0.014           NA   
 4    11 2/11/2004   Chl a         mg/m3             NA                1.31
 5    10 2/11/2004   Chl a         mg/m3             NA                1.39
  • Related