does anyone know how I can loop pipe operator code through multiple dataframe?
I've quite a few dataframe named over the years (df_1990, df_1991 ... df_2020). However, not all years are included, (i.e. df_1993, df_2012 and 3 more years are not available). To account for this, I manually created a list to store all the data frame for the looping (do enlighten me if there's a faster way for this).
df_list = list(df_1990, df_1991, ..., df_2020)
for (i in df_list) {
...
}
The dataframes are pretty simple with just 2 columns (Item (character field) & Cost (numeric field).
Item | Cost |
---|---|
Book_A | 3.00 |
Book_B | 5.00 |
... | ... |
a sample code for the dataframe
df = structure(list(Item = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 2L,
3L, 1L, 2L, 1L, 2L, 1L, 3L, 1L, 2L, 2L, 1L, 3L, 1L), .Label = c("Book A",
"Book B", "Book C"), class = "factor"), Cost = c(5, 3.5, 12,
6, 8, 3, 6, 3.5, 3.8, 13, 5.1, 7, 11.5, 3.8, 5.5, 6.5, 13.5,
5.5, 3.5, 1.2)), class = "data.frame", row.names = c(NA, -20L
))
Does anyone know how I can add in the following code into the ... portion of the for loop code above? Thank you!
df %>%
group_by(Item) %>%
summarise(outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier offset,
lower_value = outlier - offset) %>%
left_join(df, by = 'Item') %>%
transmute(Item, Cost, Outlier = ifelse(Cost < lower_value | Cost > higher_value, 'Y', 'N'))
The code basically detect the outlier (for e.g. if the cost is 60% higher or lower than majority average of the particular item) and output a column of "Y" and "N" for each row respectively. (Credits for the code goes to Ronak Shah)
Ideally the new column created should appear in the list created to allow exporting to excel format
Thank you!
CodePudding user response:
Personally I would move the data wrangling code in a function and would then use lapply
to loop over your list of data frames.
library(dplyr)
df_list <- list(df, df, df)
prep_data <- function(x) {
x %>%
group_by(Item) %>%
summarise(
outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier offset,
lower_value = outlier - offset
) %>%
left_join(x, by = "Item") %>%
transmute(Item, Cost, Outlier = ifelse(Cost < lower_value | Cost > higher_value, "Y", "N"))
}
df_prep <- lapply(df_list, prep_data)
lapply(df_prep, head, 2)
#> [[1]]
#> # A tibble: 2 × 3
#> Item Cost Outlier
#> <fct> <dbl> <chr>
#> 1 Book A 5 N
#> 2 Book A 3.5 N
#>
#> [[2]]
#> # A tibble: 2 × 3
#> Item Cost Outlier
#> <fct> <dbl> <chr>
#> 1 Book A 5 N
#> 2 Book A 3.5 N
#>
#> [[3]]
#> # A tibble: 2 × 3
#> Item Cost Outlier
#> <fct> <dbl> <chr>
#> 1 Book A 5 N
#> 2 Book A 3.5 N
If you want to do it via a for
loop then you could achieve the same result like so:
df_prep <- list()
for (i in seq_along(df_list)) {
df_prep[[i]] <- prep_data(df_list[[i]])
}
CodePudding user response:
Why don't you put all your data into one dataframe:
df_list = list(df_1990 = df_1990, df_1991 = df_1991, ..., df_2020 = df_2020)
df2 = dplyr::bind_rows(df_list, .id = 'Year')
then you only have to add the variable Year
into the group_by
statement:
group_by(Year, Item)
If you need to, you can always convert it back to a list of dataframes:
df2 %>%
tidyr::nest(data = Item:Cost) %>%
pull(data, name = Year)
Btw, you can also improve the code for the outlier detection, by omitting the join:
df2 %>%
group_by(Year, Item) %>%
mutate(outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier offset,
lower_value = outlier - offset) %>%
transmute(Item, Cost, Outlier = if_else(Cost < lower_value | Cost > higher_value, 'Y', 'N'))
using mutate
instead of summarise
copies the result of mean(Cost)
to every row of the group.