Home > other >  Make duplicate rows as replicates in R dataframe
Make duplicate rows as replicates in R dataframe

Time:11-08

I have a data frame with duplicated rows having one continuous variable column and 2-factor columns (0,1). The goal is to find the duplicated rows and identify them as replicates in a new column.

Here is the structure of the data frame

  cont.var fact1 fact2
1      1.0     1     0
2      1.0     0     1
3      1.5     1     0
4      1.5     1     0
5      1.5     0     1
6      1.5     0     1

Now let's say

  • If cont.var has value 1.0 in two rows but has different values for fact1 and fact2, so it will be assigned two different replicates.

  • If cont.var has value 1.5 and fact1/fact2 is also the same for successive rows, they will be given the same replicate identifier.

Expected Output

   cont.var fact1 fact2 rep
1       1.0     1     0   1
2       1.0     0     1   2
3       1.5     1     0   3
4       1.5     1     0   3
5       1.5     0     1   4
6       1.5     0     1   4

What I have tried

library(dplyr)

sample.df <- data.frame(
  cont.var = c(1,1,1.5,1.5,1.5,1.5,2,2,2,3),
  fact1 = c(1,0,1,1,0,0,1,1,0,1),
  fact2 = c(0,1,0,0,1,1,0,0,1,0)
)
  
sample.df %>% 
    group_by(cont.var, fact1, fact2) %>% 
    mutate(replicate = make.unique(as.character(cont.var), "_"))

Incorrect Output

  • I would expect that row-1 and row-2 will have different replicate counts.
  • I would expect that Replicate count for row-3 == row-4 and row-5 == row-6, but row-5 != row-3
   cont.var fact1 fact2 replicate
1       1.0     1     0         1
2       1.0     0     1         1
3       1.5     1     0       1.5
4       1.5     1     0     1.5_1
5       1.5     0     1       1.5
6       1.5     0     1     1.5_1

I couldn't find a straightforward solution to this; I would really appreciate any help.

Thanks in advance.

CodePudding user response:

You can use data.table::rleid:

library(dplyr)
df %>% 
  mutate(rleid = data.table::rleid(cont.var, fact1, fact2))

  cont.var fact1 fact2 rleid
1      1.0     1     0     1
2      1.0     0     1     2
3      1.5     1     0     3
4      1.5     1     0     3
5      1.5     0     1     4
6      1.5     0     1     4

If you have dplyr's dev. version, you can also use consecutive_id, the dplyr version of data.table::rleid:

#devtools::install_github("tidyverse/dplyr")
library(dplyr)
df %>% 
  mutate(rleid2 = consecutive_id(cont.var, fact1, fact2))

Finally, a base R option would be to match the rows by unique values:

df$rleid <- match(do.call(paste, df), do.call(paste, unique(df)))

CodePudding user response:

Another dplyr method, in case you're already grouped:

quux %>%
  group_by(cont.var, fact1, fact2) %>%
  mutate(rep = group_indices()) %>%
  ungroup()
# # A tibble: 6 x 4
#   cont.var fact1 fact2   rep
#      <dbl> <int> <int> <int>
# 1      1       1     0     2
# 2      1       0     1     1
# 3      1.5     1     0     4
# 4      1.5     1     0     4
# 5      1.5     0     1     3
# 6      1.5     0     1     3

While the actual values are not the same, the spirit of your request is retained.

CodePudding user response:

Here is another base R solution:

sample.df <- data.frame(
  cont.var = c(1,1,1.5,1.5,1.5,1.5,2,2,2,3),
  fact1 = c(1,0,1,1,0,0,1,1,0,1),
  fact2 = c(0,1,0,0,1,1,0,0,1,0)
)


sample.df$replicate <- cumsum(!duplicated(sample.df)) 
sample.df
#>    cont.var fact1 fact2 replicate
#> 1       1.0     1     0         1
#> 2       1.0     0     1         2
#> 3       1.5     1     0         3
#> 4       1.5     1     0         3
#> 5       1.5     0     1         4
#> 6       1.5     0     1         4
#> 7       2.0     1     0         5
#> 8       2.0     1     0         5
#> 9       2.0     0     1         6
#> 10      3.0     1     0         7

EDIT ensure dups are continuous:

sample.df <- sample.df[with(sample.df, order(fact2,fact1,cont.var)),]
sample.df$replicate <- cumsum(!duplicated(sample.df)) 
sample.df
#>    cont.var fact1 fact2 replicate
#> 1       1.0     1     0         1
#> 3       1.5     1     0         2
#> 4       1.5     1     0         2
#> 7       2.0     1     0         3
#> 8       2.0     1     0         3
#> 10      3.0     1     0         4
#> 2       1.0     0     1         5
#> 5       1.5     0     1         6
#> 6       1.5     0     1         6
#> 9       2.0     0     1         7
  • Related