Pasting string back to column names after removing it with gsub-CodePudding

I need help pasting back a string in the column name after removing it with gsub. I removed it to see how many duplicate samples there are and to keep only the paired (or duplicate samples).

For instance:

        A_111_apples_red B_111_apples_red A_11_bananas_yellow B_11_bananas_yellow
Store.A                5                6                   7                    8
Store.B                6                5                   4                    3
Store.C                3                4                   5                    6

I used a code such as:

sample <- colnames(data)
sample <- gsub("A_|B_", "", sample)
sample <- sample[duplicated(sample)]

Now that I know how many duplicate samples there are, and want to keep the duplicate samples, how can I paste "A_" and "B_" back into the column names and create a new data frame with just the duplicate samples (columns)?

CodePudding user response：

You were close. Assuming "B__" is a typo and "B_" is meant, subset data with boolean output of duplicated. I use different object names to not cause a clash with reserved function names (sample and data).

(dupes <- duplicated(gsub("A_|B_", "", colnames(dat))))
# [1] FALSE  TRUE FALSE  TRUE

(dupes <- duplicated(substr(names(dat), 1, 1)))  ## alternatively using `substr()`
# [1] FALSE  TRUE FALSE  TRUE

dat_dupes <- dat[dupes]
dat_dupes
#         B_111_apples_red B_11_bananas_yellow
# Store.A                6                   8
# Store.B                5                   3
# Store.C                4                   6

dat_wo_dupes <- dat[!dupes]
dat_wo_dupes
#         A_111_apples_red A_11_bananas_yellow
# Store.A                5                   7
# Store.B                6                   4
# Store.C                3                   5

Or, split the columns into a list like so:

dat_lst <- lapply(unique(dupes), \(x) dat[dupes == x])
dat_lst
# [[1]]
# A_111_apples_red B_111_apples_red
# Store.A                5                6
# Store.B                6                5
# Store.C                3                4
# 
# [[2]]
# A_11_bananas_yellow B_11_bananas_yellow
# Store.A                   7                   8
# Store.B                   4                   3
# Store.C                   5                   6

Note: R >= 4.1 used.

Data:

dat <- structure(list(A_111_apples_red = c(5L, 6L, 3L), B_111_apples_red = 6:4, 
    A_11_bananas_yellow = c(7L, 4L, 5L), B_11_bananas_yellow = c(8L, 
    3L, 6L)), row.names = c("Store.A", "Store.B", "Store.C"), class = "data.frame")

CodePudding user response：

Update to clarify:

I assume you want to check whether you you have duplicated column names indepdendent from the prefix A_ B_

If I am correct: then 111_apples_red string is the only one that is duplicated:

So now having 111_apples_red you want to add again the prefixes: You can do this by detecting the string 111_apples_red in the columnnames. Indirectly you add the prefixes A_ and B_ back.

library(dplyr)
data %>% 
  select(contains(sample))

        A_111_apples_red B_111_apples_red
Store A                5                6
Store B                6                5
Store C                3                4

If B__11_bananas_yellow is a typo as stated by jay.sf and should be B_11_bananas_yellos then:

with the same code:

data %>% 
  select(contains(sample))

We get:

        A_111_apples_red B_111_apples_red A_11_bananas_yellow B_11_bananas_yellow
Store A                5                6                   7                   8
Store B                6                5                   4                   3
Store C                3                4                   5                   6