I am working with a messy wide format dataset of CROP data where each crop has six associated variables (crop type, crop acreage, nitrogen in soil, nitrogen applied, organic v. conventional, and notes). There are 50 sets of these 6 variables, 300 columns total, with the unfortunate naming convention: c.0. | a.0. | s.0. | f.0. | o.0. | r.0. |, .... , | c.23. | a.23 | s.23. | f.23. |o.23. | r.23. |.
I want to iterate across sequences of columns (the 6 columns associated with each crop), saving each iteration, or individual crop, as a list (or df) with the common colnames c(CROP, CROP_ACREAGE, SOIL_N, APP_N, TYPE, NOTES). Once all of the crops have been iterated through, I want to build a tidy df by rbinding the 50 individual crops datasets.
Below is an example dataset with only two sets of crop columns, four growers, across three years:
data <- tribble(
~grower, ~ YEAR, ~c.0., ~a.0., ~s.0., ~f.0., ~o.0., ~r.0., ~c.10., ~a.10, ~s.10., ~f.10., ~o.10., ~r.10.,
"Bob", 2014, "Kale, Baby", 7.0, 87.0, 126.0, "C", "", "Carrot", 16.0, 47.8, 137.0, "O", "",
"Janet", 2015, "Broccoli", 18.0, 68.2,162.0, "O", "", "Garlic", 25.0, 9.1, 152.3, "C", "",
"Chris", 2014, "Cabbage", 34.2, 8.6, 200.7, "C", "", "Cauliflower", 105.2, 113.0, 199.4, "O", "",
"Ted", 2016, "Kale", 12.2, 11.9, 120.2, "C", "", "Lettuce, Head", 55.2, 113.0, 166.5, "C", "NY"
)
I've built code that will work... but that will be SUPER inefficient:
build <- function(choice, iter){
a <- data %>% select(choice) %>%
rename(CROP = 3,
ACREAGE = 4,
N_SOIL = 5,
N_APPLIED = 6,
O_C = 7,
NOTES = 8)
saveRDS(a, file = paste0("./intermediate-data/",iter,".RDS"))
}
build(choice = c(1:8), iter = "crop1")
build(choice = c(1:2,9:14), iter = "crop2")
# use build() function until all 50 crops have been built into separate df's
Then I'd pull these all in from the intermediate-data directory and rbind().
But I know there has got to be a more efficient way... where I can iterate across sets of 6-crop-columns without specifying them directly. Any ideas?
CodePudding user response:
We may reshape to 'long' format with pivot_longer
library(dplyr)
data %>%
pivot_longer(cols = contains("."), names_to = c(".value", "grp"),
names_pattern = "^([a-z])\\.(\\d )\\.") %>%
rename(CROP = c, ACREAGE = a, N_SOIL = s, N_APPLIED = f, O_C = o, NOTES = r)
-output
# A tibble: 12 × 9
grower YEAR grp CROP ACREAGE N_SOIL N_APPLIED O_C NOTES
<chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 Bob 2014 0 Kale, Baby 7 87 126 C ""
2 Bob 2014 10 Carrot NA 47.8 137 O ""
3 Bob 2014 <NA> <NA> NA NA NA <NA> <NA>
4 Janet 2015 0 Broccoli 18 68.2 162 O ""
5 Janet 2015 10 Garlic NA 9.1 152. C ""
6 Janet 2015 <NA> <NA> NA NA NA <NA> <NA>
7 Chris 2014 0 Cabbage 34.2 8.6 201. C ""
8 Chris 2014 10 Cauliflower NA 113 199. O ""
9 Chris 2014 <NA> <NA> NA NA NA <NA> <NA>
10 Ted 2016 0 Kale 12.2 11.9 120. C ""
11 Ted 2016 10 Lettuce, Head NA 113 166. C "NY"
12 Ted 2016 <NA> <NA> NA NA NA <NA> <NA>
>