My input data frame contains more than 100 columns and rows. I want to combine the columns if their header is the same.
Following is my input data frame
Case.ID HRAS TP53 MAP3K1 MAP3K1 TP53
TCGA_1 MSE; MSE;
TCGA_2 MUT;
TCGA_3
TCGA_4 MUT; AMP;
TCGA_5 MSE;
TCGA_6
TCGA_7 MUT;
TCGA_8 MUT; AMP;
TCGA_9 MUT;
TCGA_10
TCGA_11 FRM; st_gai;
TCGA_12 HDEL;
Expected output
Case.ID HRAS TP53 MAP3K1
TCGA_1 MSE;
TCGA_2 MUT;
TCGA_3
TCGA_4 MUT;AMP;
TCGA_5 MSE;
TCGA_6
TCGA_7 MUT;
TCGA_8 MUT; AMP;
TCGA_9 MUT;
TCGA_10
TCGA_11 FRM;st_gai;
TCGA_12 HDEL;
In the expected output, you can see I have combined the same header columns in such a way that if they have the same entry present in the row then it will print only once and if different entries are present in a row then all entries will be taken together. Here they just combine the selected column Combine two or more columns in a dataframe into a new column with a new name
CodePudding user response:
You can try putting it into long format then after adjusting the names with str_remove
we pivot wider, using the values_fn
we paste the values.
librray(tidyr)
cbind(Case.ID = rownames(df), stack(df[-1])) |>
transform(ind = as.character(ind) |> stringr::str_remove_all("\\.\\d")) |>
pivot_wider(Case.ID, names_from = ind, values_from = values,
values_fn = \(x) paste(unique(x), collapse = ""))
This assumes that the Case.Id was rownames and not a column becasue thats how it appeared when you gave your data. If Case.Id is originally a column then you do
cbind(df[-1], stack(df[-1])
instead of cbind(Case.ID = rownames(df), stack(df[-1]))
Case.ID HRAS TP53 MAP3K1
<chr> <chr> <chr> <chr>
1 TCGA_1 "" "MSE;" ""
2 TCGA_2 "" "" ""
3 TCGA_3 "" "" ""
4 TCGA_4 "" "MUT;AMP;" ""
5 TCGA_5 "" "" "MSE;"
6 TCGA_6 "" "" ""
7 TCGA_7 "MUT;" "" ""
8 TCGA_8 "MUT;" "AMP;" ""
9 TCGA_9 "MUT;" "" ""
10 TCGA_10 "" "" ""
11 TCGA_11 "" "" "FRM;st_gai;"
12 TCGA_12 "" "HDEL;" ""