R Regex capture to remove/keep columns with repeats in their column names-CodePudding

This is an example dataframe

means2 <- as.data.frame(matrix(runif(n=25, min=1, max=20), nrow=5))
names(means2) <- c("B_T0|B_T0", "B_T0|B_T1", "B_T0|Fibro_T0", "B_T5|Endo_T5", "Macro_T1|Fibro_T1")

I have column names in my dataframe in R in this format \S _T\d |\S _T\d

The syntax is something like (Name)_ (T)(Number) | (Name)_ (T)(Number)

Step 1) I want to select columns which contain the same (T)(Number) on both sides of the "|" I did this with some manual labor :

means_t0 <- means2 %>% select(matches("\\S _T0\\|\\S _T0")) %>% rownames_to_column("id_cp_interaction")
means_t1 <- means2 %>% select(matches("\\S _T1\\|\\S _T1")) %>% rownames_to_column("id_cp_interaction")
means_t5 <- means2 %>% select(matches("\\S _T5\\|\\S _T5")) %>% rownames_to_column("id_cp_interaction")
means3 <- full_join(means_t0, means_t1) %>% full_join(means_t5)

This gives me what I want and it was easy to do because I only had 3 types - T0, T1 and T5. What do I do if I had a huge number?

Step 2) From the output of Step1, I want to do a negation of the last question i.e. select only those columns with Names which are not the same For example B_T0|B_T0 should be removed but B_T0|Fibro_T0 should be retained

Is there a way to regex capture the part in front of the pipe(|) and match it to the part at the back of the pipe(|)

Thank you

CodePudding user response：

If you have that much information in your column names, I like to transform the data into the long format and then separate the info from the column name into several columns. Then it's easy to filter by these columns:

means2 <- as.data.frame(matrix(runif(n=25, min=1, max=20), nrow=5))
names(means2) <- c("B_T0|B_T0", "B_T0|B_T1", "B_T0|Fibro_T0", "B_T5|Endo_T5", "Macro_T1|Fibro_T1")
means2 <- cbind(data.frame(id_cp_interaction = 1:5), means2)

library(tidyr)
library(dplyr)
library(stringr)

res <- means2 %>% 
  pivot_longer(
    cols = -id_cp_interaction,
    names_to = "names",
    values_to = "values"
  ) %>% 
  mutate(
    celltype_1 = str_extract(names, "^[^_]*"),
    timepoint_1 = str_extract(names, "[0-9](?=|)"),
    celltype_2 = str_extract(names, "(?<=\\|)(.*?)(?=_)"),
    timepoint_2 = str_extract(names, "[0-9]$")
  )

head(res, n = 7)
#> # A tibble: 7 × 7
#>   id_cp_interaction names   values celltype_1 timepoint_1 celltype_2 timepoint_2
#>               <int> <chr>    <dbl> <chr>      <chr>       <chr>      <chr>      
#> 1                 1 B_T0|B…   1.68 B          0           B          0          
#> 2                 1 B_T0|B…  19.3  B          0           B          1          
#> 3                 1 B_T0|F…  10.6  B          0           Fibro      0          
#> 4                 1 B_T5|E…  12.5  B          5           Endo       5          
#> 5                 1 Macro_…   2.84 Macro      1           Fibro      1          
#> 6                 2 B_T0|B…   2.17 B          0           B          0          
#> 7                 2 B_T0|B…  10.1  B          0           B          1

# only keep interactions of different cell types
res %>% 
  filter(celltype_1 != celltype_2) %>% 
  head()
#> # A tibble: 6 × 7
#>   id_cp_interaction names   values celltype_1 timepoint_1 celltype_2 timepoint_2
#>               <int> <chr>    <dbl> <chr>      <chr>       <chr>      <chr>      
#> 1                 1 B_T0|F…  10.6  B          0           Fibro      0          
#> 2                 1 B_T5|E…  12.5  B          5           Endo       5          
#> 3                 1 Macro_…   2.84 Macro      1           Fibro      1          
#> 4                 2 B_T0|F…   1.47 B          0           Fibro      0          
#> 5                 2 B_T5|E…  11.3  B          5           Endo       5          
#> 6                 2 Macro_…  13.0  Macro      1           Fibro      1

^{Created on 2022-09-19 by the reprex package (v1.0.0)}