I want to make a huge table of data and there are data coming from different places, but some of the names are the same and it's not possible to decide where it came from.
I have a solution in my head, but I don't know if its possible to achieve.
Here is a part of my data:
name id sym
ENSG00000135821 2752 GLUL
ENSG00000135821 2752 GLUL
ENSG00000135821 2752 GLUL
ENSG00000135821 2752 GLUL
As you can see, I cannot decide where it came from. My idea is to modify the names of the name in the separated dataframes before merging them and getting a merged df like this:
name id sym
ENSG00000135821_sample1 2752 GLUL
ENSG00000135821_sample2 2752 GLUL
ENSG00000135821_sample3 2752 GLUL
ENSG00000135821_sample4 2752 GLUL
Is it possible to add modification to all the names in a df column with keeping the original name?
For a separate df I would like to get:
name id sym
ENSG00000135821_sample1 2752 GLUL
ENSG00000182667_sample1 50863 NTM
ENSG00000155495_sample1 9947 MAGEC1
ENSG00000198959_sample1 7052 TGM2
Thank you!
CodePudding user response:
A dplyr
solution. Group by id
and sym
and use seq_along
to get the consecutive numbers.
df1 <- 'name id sym
ENSG00000135821 2752 GLUL
ENSG00000135821 2752 GLUL
ENSG00000135821 2752 GLUL
ENSG00000135821 2752 GLUL'
df1 <- read.table(textConnection(df1), header = TRUE)
df2 <-"name id sym
ENSG00000135821 2752 GLUL
ENSG00000182667 50863 NTM
ENSG00000155495 9947 MAGEC1
ENSG00000198959 7052 TGM2"
df2 <- read.table(textConnection(df2), header = TRUE)
suppressPackageStartupMessages(
library(dplyr)
)
df1 %>%
group_by(id, sym) %>%
mutate(name = paste0(name, "_sample", seq_along(name))) %>%
ungroup()
#> # A tibble: 4 × 3
#> name id sym
#> <chr> <int> <chr>
#> 1 ENSG00000135821_sample1 2752 GLUL
#> 2 ENSG00000135821_sample2 2752 GLUL
#> 3 ENSG00000135821_sample3 2752 GLUL
#> 4 ENSG00000135821_sample4 2752 GLUL
Created on 2022-10-14 with reprex v2.0.2
This can be written as function and applied to any data set as long as the columns names are the same, name
, id
and sym
.
newname <- function(x) {
x %>%
group_by(id, sym) %>%
mutate(name = paste0(name, "_sample", seq_along(name))) %>%
ungroup()
}
newname(df1)
#> # A tibble: 4 × 3
#> name id sym
#> <chr> <int> <chr>
#> 1 ENSG00000135821_sample1 2752 GLUL
#> 2 ENSG00000135821_sample2 2752 GLUL
#> 3 ENSG00000135821_sample3 2752 GLUL
#> 4 ENSG00000135821_sample4 2752 GLUL
newname(df2)
#> # A tibble: 4 × 3
#> name id sym
#> <chr> <int> <chr>
#> 1 ENSG00000135821_sample1 2752 GLUL
#> 2 ENSG00000182667_sample1 50863 NTM
#> 3 ENSG00000155495_sample1 9947 MAGEC1
#> 4 ENSG00000198959_sample1 7052 TGM2
Created on 2022-10-14 with reprex v2.0.2
CodePudding user response:
Here is another option. Put all the dataframes in a list, then map out new names in each dataframe, then combine after each dataframe has a new name:
library(tidyverse)
#example data
df3 <- df2 <- df1 <-read_table("name id sym
ENSG00000135821 2752 GLUL
ENSG00000182667 50863 NTM
ENSG00000155495 9947 MAGEC1
ENSG00000198959 7052 TGM2")
list(df1, df2, df3) |>
(\(l) map2_dfr(l, 1:length(l),\(df, num){
mutate(df, name = glue::glue("{name}_sample{num}"))
}))() |>
arrange(name, id)
#> # A tibble: 12 x 3
#> name id sym
#> <glue> <dbl> <chr>
#> 1 ENSG00000135821_sample1 2752 GLUL
#> 2 ENSG00000135821_sample2 2752 GLUL
#> 3 ENSG00000135821_sample3 2752 GLUL
#> 4 ENSG00000155495_sample1 9947 MAGEC1
#> 5 ENSG00000155495_sample2 9947 MAGEC1
#> 6 ENSG00000155495_sample3 9947 MAGEC1
#> 7 ENSG00000182667_sample1 50863 NTM
#> 8 ENSG00000182667_sample2 50863 NTM
#> 9 ENSG00000182667_sample3 50863 NTM
#> 10 ENSG00000198959_sample1 7052 TGM2
#> 11 ENSG00000198959_sample2 7052 TGM2
#> 12 ENSG00000198959_sample3 7052 TGM2