Home > Net >  Combine same header columns in R data frame
Combine same header columns in R data frame

Time:08-09

My input data frame contains more than 100 columns and rows. I want to combine the columns if their header is the same.

Following is my input data frame

Case.ID HRAS    TP53    MAP3K1  MAP3K1  TP53
TCGA_1          MSE;                    MSE;
TCGA_2  MUT;                  
TCGA_3                    
TCGA_4          MUT;                    AMP;
TCGA_5                  MSE;          
TCGA_6                    
TCGA_7  MUT;                  
TCGA_8  MUT;    AMP;              
TCGA_9  MUT;                  
TCGA_10                   
TCGA_11                 FRM;    st_gai;   
TCGA_12                                 HDEL;
          

Expected output

Case.ID HRAS    TP53      MAP3K1
TCGA_1          MSE;      
TCGA_2  MUT;          
TCGA_3            
TCGA_4          MUT;AMP;      
TCGA_5                    MSE;
TCGA_6            
TCGA_7  MUT;          
TCGA_8  MUT;    AMP;      
TCGA_9  MUT;          
TCGA_10           
TCGA_11                  FRM;st_gai;
TCGA_12         HDEL;     

In the expected output, you can see I have combined the same header columns in such a way that if they have the same entry present in the row then it will print only once and if different entries are present in a row then all entries will be taken together. Here they just combine the selected column Combine two or more columns in a dataframe into a new column with a new name

CodePudding user response:

You can try putting it into long format then after adjusting the names with str_remove we pivot wider, using the values_fn we paste the values.

librray(tidyr)

cbind(Case.ID = rownames(df), stack(df[-1])) |> 
  transform(ind = as.character(ind) |> stringr::str_remove_all("\\.\\d")) |> 
  pivot_wider(Case.ID, names_from = ind, values_from = values, 
              values_fn = \(x) paste(unique(x), collapse = ""))

This assumes that the Case.Id was rownames and not a column becasue thats how it appeared when you gave your data. If Case.Id is originally a column then you do

cbind(df[-1], stack(df[-1]) instead of cbind(Case.ID = rownames(df), stack(df[-1]))

  Case.ID HRAS   TP53       MAP3K1       
   <chr>   <chr>  <chr>      <chr>        
 1 TCGA_1  ""     "MSE;"     ""           
 2 TCGA_2  ""     ""         ""           
 3 TCGA_3  ""     ""         ""           
 4 TCGA_4  ""     "MUT;AMP;" ""           
 5 TCGA_5  ""     ""         "MSE;"       
 6 TCGA_6  ""     ""         ""           
 7 TCGA_7  "MUT;" ""         ""           
 8 TCGA_8  "MUT;" "AMP;"     ""           
 9 TCGA_9  "MUT;" ""         ""           
10 TCGA_10 ""     ""         ""           
11 TCGA_11 ""     ""         "FRM;st_gai;"
12 TCGA_12 ""     "HDEL;"    ""             
  • Related