Reorder rows in a dataframe based on column in another dataframe but with non-unique values-CodePudding

I have 2 dataframes.

The first dataframe, df1, has 1246 rows and looks like this:

                 gene1         gene2          gene3
AAAB.P1           1.23          2.28          -2.85
AABC.P1           4.06         -0.59          -2.42
ABCD.P1           3.26          2.19          -3.01
ABCD.R1           1.23          0.15          -2.30
DBCA.P1           1.67         -0.51          -3.24

The second dataframe, df2, also has 1246 rows and looks like this:

    id      primary_diagnosis
1   ABCD    carcinoma
2   ABCD    carcinoma
3   AAAB    AS
4   DBCA    carcinoma
5   EFGH    other
6   LMNO    AS

I need to rearrange the second df so that it is in the same order as the first df, based on the id column. I tried to use

df2<-df2[order(match(df2$id, substr(rownames(df1), 1, 4))), ]

But it only gave me 21 rows that were in the correct position. I think the problem is that df2 contains some duplicates. For example, ABCD has both ABCD.P1 and ABCD.R1 in df1 but only one entry in df2. It doesn't matter which order they go in because both P1 and R1 will match up with carcinoma. How can I reorder df2 to match whatever order df1 is already in?

CodePudding user response：

We could use simple arrange in this way:

library(dplyr)

df2 %>% 
  arrange(id, df1$id)

    id primary_diagnosis
3 AAAB                AS
1 ABCD         carcinoma
2 ABCD         carcinoma
4 DBCA         carcinoma
5 EFGH             other
6 LMNO                AS

#df1
df1 <- structure(list(gene1 = c(1.23, 4.06, 3.26, 1.23, 1.67), gene2 = c(2.28, 
-0.59, 2.19, 0.15, -0.51), gene3 = c(-2.85, -2.42, -3.01, -2.3, 
-3.24)), class = "data.frame", row.names = c("AAAB.P1", "AABC.P1", 
"ABCD.P1", "ABCD.R1", "DBCA.P1"))

#df2
df2 <- structure(list(id = c("ABCD", "ABCD", "AAAB", "DBCA", "EFGH", 
"LMNO"), primary_diagnosis = c("carcinoma", "carcinoma", "AS", 
"carcinoma", "other", "AS")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

CodePudding user response：

How about this:

df1 <- read.table(text = "id gene1 gene2 gene3
AAAB.P1 1.23 2.28 -2.85
AABC.P1 4.06 -0.59 -2.42
ABCD.P1 3.26 2.19 -3.01
ABCD.R1 1.23 0.15 -2.30
DBCA.P1 1.67 -0.51 -3.24", sep=" ", header=TRUE)
rownames(df1) <- df1$id
df1 <- df1[,-1]


df2 <- read.table(text="id primary_diagnosis
ABCD carcinoma
ABCD carcinoma
AAAB AS
DBCA carcinoma
EFGH other
LMNO AS", sep=" ", header=TRUE)


df2[match(substr(rownames(df1), 1, 4), df2$id), ]
#>       id primary_diagnosis
#> 3   AAAB                AS
#> NA  <NA>              <NA>
#> 1   ABCD         carcinoma
#> 1.1 ABCD         carcinoma
#> 4   DBCA         carcinoma

^{Created on 2022-05-23 by the reprex package (v2.0.1)}