I need help pasting back a string in the column name after removing it with gsub. I removed it to see how many duplicate samples there are and to keep only the paired (or duplicate samples).
For instance:
A_111_apples_red B_111_apples_red A_11_bananas_yellow B_11_bananas_yellow
Store.A 5 6 7 8
Store.B 6 5 4 3
Store.C 3 4 5 6
I used a code such as:
sample <- colnames(data)
sample <- gsub("A_|B_", "", sample)
sample <- sample[duplicated(sample)]
Now that I know how many duplicate samples there are, and want to keep the duplicate samples, how can I paste "A_" and "B_" back into the column names and create a new data frame with just the duplicate samples (columns)?
CodePudding user response:
You were close. Assuming "B__"
is a typo and "B_"
is meant, subset dat
a with boolean output of duplicated
. I use different object names to not cause a clash with reserved function names (sample
and data
).
(dupes <- duplicated(gsub("A_|B_", "", colnames(dat))))
# [1] FALSE TRUE FALSE TRUE
(dupes <- duplicated(substr(names(dat), 1, 1))) ## alternatively using `substr()`
# [1] FALSE TRUE FALSE TRUE
dat_dupes <- dat[dupes]
dat_dupes
# B_111_apples_red B_11_bananas_yellow
# Store.A 6 8
# Store.B 5 3
# Store.C 4 6
dat_wo_dupes <- dat[!dupes]
dat_wo_dupes
# A_111_apples_red A_11_bananas_yellow
# Store.A 5 7
# Store.B 6 4
# Store.C 3 5
Or, split the columns into a list like so:
dat_lst <- lapply(unique(dupes), \(x) dat[dupes == x])
dat_lst
# [[1]]
# A_111_apples_red B_111_apples_red
# Store.A 5 6
# Store.B 6 5
# Store.C 3 4
#
# [[2]]
# A_11_bananas_yellow B_11_bananas_yellow
# Store.A 7 8
# Store.B 4 3
# Store.C 5 6
Note: R >= 4.1 used.
Data:
dat <- structure(list(A_111_apples_red = c(5L, 6L, 3L), B_111_apples_red = 6:4,
A_11_bananas_yellow = c(7L, 4L, 5L), B_11_bananas_yellow = c(8L,
3L, 6L)), row.names = c("Store.A", "Store.B", "Store.C"), class = "data.frame")
CodePudding user response:
Update to clarify:
I assume you want to check whether you you have duplicated column names indepdendent from the prefix A_
B_
If I am correct: then 111_apples_red
string is the only one that is duplicated:
So now having 111_apples_red
you want to add again the prefixes:
You can do this by detecting the string 111_apples_red
in the columnnames.
Indirectly you add the prefixes A_
and B_
back.
library(dplyr)
data %>%
select(contains(sample))
A_111_apples_red B_111_apples_red
Store A 5 6
Store B 6 5
Store C 3 4
If B__11_bananas_yellow
is a typo as stated by jay.sf and should be B_11_bananas_yellos
then:
with the same code:
data %>%
select(contains(sample))
We get:
A_111_apples_red B_111_apples_red A_11_bananas_yellow B_11_bananas_yellow
Store A 5 6 7 8
Store B 6 5 4 3
Store C 3 4 5 6