Home > database >  Remove columns of dataframe based on column names of itself and following columns in R
Remove columns of dataframe based on column names of itself and following columns in R

Time:09-23

I have a dataframe consisting of 271 columns. Each column is named as "sampleName_time".

 [1] "A01_0"  "A01_24" "A01_48" "A01_72" "A02_48" "A03_0"  "A03_24" "A03_48" "A04_24" "A04_48" "A05_0"  "A05_24" "A05_48"
 [14] "A05_72" "A06_0"  "A06_48" "A07_0"  "A07_24" "A07_48" "A08_0"  "A08_24" "A08_48" "A08_72" "A09_24" "A09_48" "A09_72"
 [27] "A10_0"  "A10_24" "A10_48" "A11_0"  "A11_48" "A11_72" "A12_48" "B01_0"  "B01_24" "B01_48" "B01_72" "B02_24" "B02_48"
...

The time can be 0, 24, 48 or 72. I want to keep the columns that refer to the same sample and have measurements on all time points (0, 24, 48 and 72). For example, The sample A01 is ok because we have columns "A01_0", "A01_24", "A01_48" and "A01_72". A02 is not ok because there is only one column named "A02_48" but no others. A03 is not ok too. But A05 is ok. So for the above example, the pruned dataframe I want would be as follows

"A01_0"  "A01_24" "A01_48" "A01_72" "A05_0"  "A05_24" "A05_48"
"A05_72" "A08_0"  "A08_24" "A08_48" "A08_72" "B01_0"  "B01_24" "B01_48" "B01_72" 
...

The following is my approach. But it seems so complicated as I need to have the nested if statements and maybe I need to use next to skip some iterations of the for loop.

for (i in seq_along(colnames(exprs))){
  if (tr_split(colnames(exprs)[i], "_")[1][2] == 0){
    if (tr_split(colnames(exprs)[i 1], "_")[1][2] == 24){
      # not complete with more if statements
    }
  }
}

Are there some neat ways to do that? I think there may be some neat way using sapply(colnames(exprs),function(x){ someFunction}) Thanks

CodePudding user response:

Consider this mock data set that reproduces your column names.

df <- data.frame(replicate(8, 1:10))
colnames(df) <- c("A01_0", "A01_24", "A01_48", "A01_72", "A02_48", "A03_0",
                  "A03_24", "A03_48")

#    A01_0 A01_24 A01_48 A01_72 A02_48 A03_0 A03_24 A03_48
# 1      1      1      1      1      1     1      1      1
# 2      2      2      2      2      2     2      2      2
# 3      3      3      3      3      3     3      3      3
# 4      4      4      4      4      4     4      4      4
# 5      5      5      5      5      5     5      5      5
# 6      6      6      6      6      6     6      6      6
# 7      7      7      7      7      7     7      7      7
# 8      8      8      8      8      8     8      8      8
# 9      9      9      9      9      9     9      9      9
# 10    10     10     10     10     10    10     10     10

What you can do is keep only groups of columns where you have 4 different values (0, 24, 48, 72):

library(dplyr)
un <- data.frame(col = colnames(df)) %>% 
  group_by(gp = substr(col, 1, 3)) %>% 
  filter(n() == 4) %>% 
  pull(gp) %>% 
  unique()

# In this case, we have
un
#[1] "A01"

Then, use that vector get columns that starts_with/contains one of the values in un:

df %>% 
  select(starts_with(un))

#    A01_0 A01_24 A01_48 A01_72
# 1      1      1      1      1
# 2      2      2      2      2
# 3      3      3      3      3
# 4      4      4      4      4
# 5      5      5      5      5
# 6      6      6      6      6
# 7      7      7      7      7
# 8      8      8      8      8
# 9      9      9      9      9
# 10    10     10     10     10

CodePudding user response:

Another approach:

df <- data.frame(A01_0 = 1,
                 A01_24 = 2,
                 A01_48 = 3,
                 A02_48 = 1, 
                 A05_0 = 1,
                 A05_24 = 1)

df

A01_0 A01_24 A01_48 A02_48 A05_0 A05_24
1     1      2      3      1     1      1

df[substr(names(df), 1, 3) %in% names(which(table(gsub('_\\d ', '', names(df)) ) > 1))]
  A01_0 A01_24 A01_48 A05_0 A05_24
1     1      2      3     1      1

Breaking down the steps:

gsub('_\\d ', '', names(df))
[1] "A01" "A01" "A01" "A02" "A05" "A05"
names(table(gsub('_\\d ', '', names(df)) ) > 1)
[1] "A01" "A02" "A05"
which(table(gsub('_\\d ', '', names(df)) ) > 1)
A01 A05 
  1   3 
names(which(table(gsub('_\\d ', '', names(df)) ) > 1))
[1] "A01" "A05"
  • Related