Home > Software engineering >  r apply functions over list of data frames
r apply functions over list of data frames

Time:06-22

Help with applying functions over a list of data frames.

I don't often work with lists or functions so following a 3 hour search and test I need some assistance.

I have a list of 2 data frames as follows (real list has 40 ):

df1 <- structure(list(ID = 1:4, 
    Period = c("C_2021", "C_2021", "C_2021", "C_2021"), 
    subjects = c(2044L, 2044L, 2058L, 2059L), 
    Q_1_A = c(1L, 1L, 4L, 6L), 
    Q_1_B = c(6L, 1L, 6L, NA), 
    col3 = c(4L, 6L, 5L, 2L), 
    col4 = c(3L, 5L, 4L, 4L)), 
    class = "data.frame", row.names = c(NA, -4L))
        
    df2 <- structure(list(ID = 1:4, 
    Period = c("C_2022", "C_2022", "C_2022", "C_2022"), 
    subjects = c(2058L, 2058L, 2065L, 2066L), 
    Q_1_A = c(2L, 5L, 5L, 6L), 
    Q_1_B = c(6L, 1L, 4L, NA), 
    col3 = c(NA, 6L, 5L, 3L), 
    col4 = c(3L, 6L, 5L, 5L)), 
    class = "data.frame", row.names = c(NA, -4L))

The structure of the datasets are as follows:

    df1
      ID Period subjects Q_1_A Q_1_B col3 col4
    1  1 C_2021     2044     1     6    4    3
    2  2 C_2021     2044     1     1    6    5
    3  3 C_2021     2058     4     6    5    4
    4  4 C_2021     2059     6    NA    2    4
    
    df2
      ID Period subjects Q_1_A Q_1_B col3 col4
    1  1 C_2022     2058     2     6   NA    3
    2  2 C_2022     2058     5     1    6    6
    3  3 C_2022     2065     5     4    5    5
    4  4 C_2022     2066     6    NA    3    5

The list of df's

dflist <- list(df1, df2)

I would like to do 2 things:

1. Conditional removal of string before 2nd underscore

I would like to remove characters before the 2nd underscore only in columns beginning with "Q". Column "Q_1_A" would become "A". The code should only impact columns starting with "Q".

Note: The ifelse is important - in the real data there are other columns with 2 underscores that cannot be modified, and the columns in data frames may be in different orders so it needs to be done by column name.

#doesnt work (cant seem to get purr working either)
    dflist <- lapply(dflist, function(x) {
      names(x) <- ifelse(starts_with(names(x), "Q"), sub("^[^_]*_", "", names(x)), .x)
      x})

2. Once column names are updated, remove columns present on a list.
Note: In the real data there are a lot of columns in each df, it's much easier to list the columns to keep rather than remove.

List of columns to keep below List is structured assuming the gsub above has been complete.

col_keep <- c("ID", "Period", "subjects", "A", "B")

#doesnt work
dflist <- lapply(dflist, function(x) {
  x[(names(x) %in% col_keep)]
  x})

**UPDATE** I think actually the following will work
dflist <- lapply(dflist, function(x) 
{x <- x %>% select(any_of(col_keep))})
#is the best way to do it?

Help would be greatly appreciated.

CodePudding user response:

For the first required apply this

dflist <- lapply(dflist, function(x) {
    names(x) <- ifelse(startsWith(names(x), "Q"), 
    gsub("[Q_0-9] ", "" , names(x)), names(x))
    x})

and the second

col_keep <- c("ID", "Period", "subjects", "A", "B")
dflist <- lapply(dflist, function(x) subset(x , select = col_keep))

CodePudding user response:

In base R:

lapply(dflist, \(x)setNames(x, sub('^Q([^_]*_){2}', '', names(x)))[col_keep])
[[1]]
  ID Period subjects A  B
1  1 C_2021     2044 1  6
2  2 C_2021     2044 1  1
3  3 C_2021     2058 4  6
4  4 C_2021     2059 6 NA

[[2]]
  ID Period subjects A  B
1  1 C_2022     2058 2  6
2  2 C_2022     2058 5  1
3  3 C_2022     2065 5  4
4  4 C_2022     2066 6 NA

in tidyverse:

library(tidyverse)
dflist %>%
  map(~rename_with(.,~str_remove(.,'([^_] _){2}'), starts_with('Q'))%>%
        select(all_of(col_keep)))

[[1]]
  ID Period subjects A  B
1  1 C_2021     2044 1  6
2  2 C_2021     2044 1  1
3  3 C_2021     2058 4  6
4  4 C_2021     2059 6 NA

[[2]]
  ID Period subjects A  B
1  1 C_2022     2058 2  6
2  2 C_2022     2058 5  1
3  3 C_2022     2065 5  4
4  4 C_2022     2066 6 NA

CodePudding user response:

Another solutions using base:

# wrap up code for ease of reading
validate_names <- function(df) {

setNames(df, ifelse(grepl("^Q", names(df)), 
         gsub("[Q_0-9]", "", names(df)), names(df)))
}

# lapply to transform list, then subset with character vector
lapply(dflist, validate_names) |> 
lapply(`[`, col_keep)
  • Related