Filter out all data frames which don't have the column Z in a list of data frames?-CodePudding

I have a list of six data frames, from which 5/6 data frames have a column "Z". To proceed with my script, I need to remove the data frame which doesn't have column Z, so I tried the following code:

for(i in 1:length(df)){
  if(!("Z" %in% colnames(df[[i]])))
  {
    df[[i]] = NULL
  }
}

This seem'd to actually do the job (it removed the one data frame from the list, which didn't have the column Z), BUT however I still got a message "Error in df[[i]] : subscript out of bounds". Why is that, and how could I get around the error?

CodePudding user response：

The base Filter function works well here:

df <- Filter(\(x) "Z" %in% names(x), df)

As to why your method doesn't work, for(i in 1:length(df)) iterates over each item in the original length(df). As soon as df[[i]] = NULL happens once, then df is shorter than it was when the loop started, so the last iteration will be out of bounds. And you'll also skip some items: if df[[2]] is removed then the original df[[3]] is now df[[2]], and the current df[[3]] was originally df[[4]], so you hop over the original df[[3]] without checking it. Lesson: don't change the length of objects in the midst of iterating over them.

CodePudding user response：

If df is your list of 6 dataframes, you can do this:

df <- df[sapply(df, \(i) "Z" %in% colnames(i))]

The reason you get the error is that your loop will reduce the length of df, such that i will eventually be beyond the (new) length of df. There will be no error if the only frame in df without column Z is the last frame.

CodePudding user response：

Using discard:

list_df <- list(df1, df2)
purrr::discard(list_df, ~any(colnames(.x) == "Z"))

Output:

[[1]]
  A B
1 1 3
2 3 4

As you can see it removed the first dataframe which had column Z.

data

df1 <- data.frame(A = c(1,2),
                  Z = c(1,4))

df2 <- data.frame(A = c(1,3),
                  B = c(3,4))