I have a data set (original version, # A tibble: 33,478 x 12
) of the form similar to the attached picture, and partial data:
dput(head(canals2, n=10))
structure(list(Site = c(1, 2, 4, 11, 10, 12, 13, 14, 15, 16),
`Sample Date` = c("2/11/2004", "2/11/2004", "2/11/2004",
"2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004",
"2/11/2004", "2/11/2004"), `Analysis code` = c("NH3", "NH3",
"NH3", "Chl a", "Chl a", "Chl a", "NH3", "Chl a", "NH3",
"NH3"), Analysis = c("Ammonia-Nitrogen", "Ammonia-Nitrogen",
"Ammonia-Nitrogen", "Chlorophyll a", "Chlorophyll a", "Chlorophyll a",
"Ammonia-Nitrogen", "Chlorophyll a", "Ammonia-Nitrogen",
"Ammonia-Nitrogen"), Result = c(0.068, 0.07, 0.014, 1.31,
1.39, 1.95, 0.247, 1.46, 0.113, 0.17), Units = c("mg/L",
"mg/L", "mg/L", "mg/m3", "mg/m3", "mg/m3", "mg/L", "mg/m3",
"mg/L", "mg/L")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
I would like to try to predict, for instance, "Chlorophyll a" from "Ammonia-Nitrogen" using a linear model (say, using the lm()
function). lm()
takes column names as inputs for 'formula', yet this dataset was generated very differently. I should be using the values in the Results
column for each analysis, but I can't seem to find a good way of organizing my data as such.
So far, I tried splitting the data by analysis, with the intention of creating a new dataframe for each analysis, then replacing the Result
with the name of the analysis selected in that dataframe. Here is the function I used (ran it on the main data set, which is why it contains more analysis names):
analysis_list = unique(canals$Analysis)
> analysis_list
1 "Ammonia-Nitrogen" "Chlorophyll a" "Fecal Coliform"
[4] "Specific Conductance" "Copper" "Dissolved Oxygen"
[7] "E Coli" "Enterococci" "Nitrite Nitrate"
[10] "Ortho-Phosphate" "pH" "Salinity"
[13] "Temperature" "Total Kjeldahl Nitrogen" "Total Nitrogen"
[16] "Total Phosphorus" "Turbidity"
split_an <- function() {
my_list <- vector(mode = "list", length = 0)
for (i in 1:17) {
analysis_var = analysis_list[i]
my_var <- canals %>% filter(Analysis == analysis_var)
my_list[[i]] = my_var
}
}
split_an()
Unfortunately, that didn't work as expected, and I had many issues merging the tables I created. I tried other ways as well, but I am getting nowhere. Is anyone willing to offer any suggestions?
CodePudding user response:
If I understand correctly, then it sounds like you're trying to restructure your data to get it into the proper form for modelling purposes. I think using pivot_wider
(from tidyr
) will get you what you want. Here's what I did:
First, here's your data as a dataframe:
Site <- c(1, 2, 4, 11, 10, 12, 13, 14, 15, 16)
Sample_Date <- c("2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004",
"2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004", "2/11/2004")
Analysis_code <- c("NH3", "NH3", "NH3", "Chl a", "Chl a", "Chl a", "NH3", "Chl
a", "NH3", "NH3")
Analysis <- c("Ammonia-Nitrogen", "Ammonia-Nitrogen", "Ammonia-Nitrogen",
"Chlorophyll a", "Chlorophyll a", "Chlorophyll a", "Ammonia-Nitrogen",
"Chlorophyll a", "Ammonia-Nitrogen", "Ammonia-Nitrogen")
Results <- c(0.068, 0.07, 0.014, 1.31, 1.39, 1.95, 0.247, 1.46, 0.113, 0.17)
Units <- c("mg/L", "mg/L", "mg/L", "mg/m3", "mg/m3", "mg/m3", "mg/L", "mg/m3",
"mg/L", "mg/L")
Site Sample_Date Analysis_code Analysis Results Units
1 1 2/11/2004 NH3 Ammonia-Nitrogen 0.068 mg/L
2 2 2/11/2004 NH3 Ammonia-Nitrogen 0.070 mg/L
3 4 2/11/2004 NH3 Ammonia-Nitrogen 0.014 mg/L
4 11 2/11/2004 Chl a Chlorophyll a 1.310 mg/m3
5 10 2/11/2004 Chl a Chlorophyll a 1.390 mg/m3
Next, we'll apply pivot_wider
to spread the Analysis
variable. This will leave you with a column for each of the Analysis
types you have, along with their respective Results
value.
#spread the analysis variable
new_df <- df %>%
pivot_wider(names_from = "Analysis", values_from = "Results")
Site Sample_Date Analysis_code Units `Ammonia-Nitrogen` `Chlorophyll a`
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1 2/11/2004 NH3 mg/L 0.068 NA
2 2 2/11/2004 NH3 mg/L 0.07 NA
3 4 2/11/2004 NH3 mg/L 0.014 NA
4 11 2/11/2004 Chl a mg/m3 NA 1.31
5 10 2/11/2004 Chl a mg/m3 NA 1.39