Home > Enterprise >  Looping pipe operator code through multiple Dataframe in R
Looping pipe operator code through multiple Dataframe in R

Time:07-04

does anyone know how I can loop pipe operator code through multiple dataframe?

I've quite a few dataframe named over the years (df_1990, df_1991 ... df_2020). However, not all years are included, (i.e. df_1993, df_2012 and 3 more years are not available). To account for this, I manually created a list to store all the data frame for the looping (do enlighten me if there's a faster way for this).

df_list = list(df_1990, df_1991, ..., df_2020)
for (i in df_list) {
 ...
 }

The dataframes are pretty simple with just 2 columns (Item (character field) & Cost (numeric field).

Item Cost
Book_A 3.00
Book_B 5.00
... ...

a sample code for the dataframe

df = structure(list(Item = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 2L, 
3L, 1L, 2L, 1L, 2L, 1L, 3L, 1L, 2L, 2L, 1L, 3L, 1L), .Label = c("Book A", 
"Book B", "Book C"), class = "factor"), Cost = c(5, 3.5, 12, 
6, 8, 3, 6, 3.5, 3.8, 13, 5.1, 7, 11.5, 3.8, 5.5, 6.5, 13.5, 
5.5, 3.5, 1.2)), class = "data.frame", row.names = c(NA, -20L
))

Does anyone know how I can add in the following code into the ... portion of the for loop code above? Thank you!

df %>%
  group_by(Item) %>%
  summarise(outlier = mean(Cost), 
            offset = outlier * 0.6, 
            higher_value = outlier   offset, 
            lower_value = outlier - offset) %>%
  left_join(df, by = 'Item') %>%
  transmute(Item, Cost, Outlier = ifelse(Cost < lower_value | Cost > higher_value, 'Y', 'N'))

The code basically detect the outlier (for e.g. if the cost is 60% higher or lower than majority average of the particular item) and output a column of "Y" and "N" for each row respectively. (Credits for the code goes to Ronak Shah)

Ideally the new column created should appear in the list created to allow exporting to excel format

Thank you!

CodePudding user response:

Personally I would move the data wrangling code in a function and would then use lapply to loop over your list of data frames.

library(dplyr)

df_list <- list(df, df, df)

prep_data <- function(x) {
  x %>%
    group_by(Item) %>%
    summarise(
      outlier = mean(Cost),
      offset = outlier * 0.6,
      higher_value = outlier   offset,
      lower_value = outlier - offset
    ) %>%
    left_join(x, by = "Item") %>%
    transmute(Item, Cost, Outlier = ifelse(Cost < lower_value | Cost > higher_value, "Y", "N"))
}

df_prep <- lapply(df_list, prep_data)

lapply(df_prep, head, 2)
#> [[1]]
#> # A tibble: 2 × 3
#>   Item    Cost Outlier
#>   <fct>  <dbl> <chr>  
#> 1 Book A   5   N      
#> 2 Book A   3.5 N      
#> 
#> [[2]]
#> # A tibble: 2 × 3
#>   Item    Cost Outlier
#>   <fct>  <dbl> <chr>  
#> 1 Book A   5   N      
#> 2 Book A   3.5 N      
#> 
#> [[3]]
#> # A tibble: 2 × 3
#>   Item    Cost Outlier
#>   <fct>  <dbl> <chr>  
#> 1 Book A   5   N      
#> 2 Book A   3.5 N

If you want to do it via a for loop then you could achieve the same result like so:

df_prep <- list()
for (i in seq_along(df_list)) {
  df_prep[[i]] <- prep_data(df_list[[i]])
}

CodePudding user response:

Why don't you put all your data into one dataframe:

df_list = list(df_1990 = df_1990, df_1991 = df_1991, ..., df_2020 = df_2020)
df2 = dplyr::bind_rows(df_list, .id = 'Year')

then you only have to add the variable Year into the group_by statement:

group_by(Year, Item)

If you need to, you can always convert it back to a list of dataframes:

df2 %>% 
  tidyr::nest(data =  Item:Cost) %>% 
  pull(data, name = Year)

Btw, you can also improve the code for the outlier detection, by omitting the join:

df2 %>%
  group_by(Year, Item) %>%
  mutate(outlier = mean(Cost), 
            offset = outlier * 0.6, 
            higher_value = outlier   offset, 
        lower_value = outlier - offset) %>%
  transmute(Item, Cost, Outlier = if_else(Cost < lower_value | Cost > higher_value, 'Y', 'N'))

using mutate instead of summarise copies the result of mean(Cost) to every row of the group.

  • Related