I have a data frame with duplicated rows having one continuous variable column and 2-factor columns (0,1). The goal is to find the duplicated rows and identify them as replicates in a new column.
Here is the structure of the data frame
cont.var fact1 fact2
1 1.0 1 0
2 1.0 0 1
3 1.5 1 0
4 1.5 1 0
5 1.5 0 1
6 1.5 0 1
Now let's say
If
cont.var
has value 1.0 in two rows but has different values forfact1
andfact2
, so it will be assigned two different replicates.If
cont.var
has value 1.5 andfact1
/fact2
is also the same for successive rows, they will be given the same replicate identifier.
Expected Output
cont.var fact1 fact2 rep
1 1.0 1 0 1
2 1.0 0 1 2
3 1.5 1 0 3
4 1.5 1 0 3
5 1.5 0 1 4
6 1.5 0 1 4
What I have tried
library(dplyr)
sample.df <- data.frame(
cont.var = c(1,1,1.5,1.5,1.5,1.5,2,2,2,3),
fact1 = c(1,0,1,1,0,0,1,1,0,1),
fact2 = c(0,1,0,0,1,1,0,0,1,0)
)
sample.df %>%
group_by(cont.var, fact1, fact2) %>%
mutate(replicate = make.unique(as.character(cont.var), "_"))
Incorrect Output
- I would expect that
row-1
androw-2
will have different replicate counts. - I would expect that Replicate count for
row-3
==row-4
androw-5
==row-6
, butrow-5
!=row-3
cont.var fact1 fact2 replicate
1 1.0 1 0 1
2 1.0 0 1 1
3 1.5 1 0 1.5
4 1.5 1 0 1.5_1
5 1.5 0 1 1.5
6 1.5 0 1 1.5_1
I couldn't find a straightforward solution to this; I would really appreciate any help.
Thanks in advance.
CodePudding user response:
You can use data.table::rleid
:
library(dplyr)
df %>%
mutate(rleid = data.table::rleid(cont.var, fact1, fact2))
cont.var fact1 fact2 rleid
1 1.0 1 0 1
2 1.0 0 1 2
3 1.5 1 0 3
4 1.5 1 0 3
5 1.5 0 1 4
6 1.5 0 1 4
If you have dplyr's dev. version, you can also use consecutive_id
, the dplyr
version of data.table::rleid
:
#devtools::install_github("tidyverse/dplyr")
library(dplyr)
df %>%
mutate(rleid2 = consecutive_id(cont.var, fact1, fact2))
Finally, a base R
option would be to match
the rows by unique values:
df$rleid <- match(do.call(paste, df), do.call(paste, unique(df)))
CodePudding user response:
Another dplyr method, in case you're already grouped:
quux %>%
group_by(cont.var, fact1, fact2) %>%
mutate(rep = group_indices()) %>%
ungroup()
# # A tibble: 6 x 4
# cont.var fact1 fact2 rep
# <dbl> <int> <int> <int>
# 1 1 1 0 2
# 2 1 0 1 1
# 3 1.5 1 0 4
# 4 1.5 1 0 4
# 5 1.5 0 1 3
# 6 1.5 0 1 3
While the actual values are not the same, the spirit of your request is retained.
CodePudding user response:
Here is another base R solution:
sample.df <- data.frame(
cont.var = c(1,1,1.5,1.5,1.5,1.5,2,2,2,3),
fact1 = c(1,0,1,1,0,0,1,1,0,1),
fact2 = c(0,1,0,0,1,1,0,0,1,0)
)
sample.df$replicate <- cumsum(!duplicated(sample.df))
sample.df
#> cont.var fact1 fact2 replicate
#> 1 1.0 1 0 1
#> 2 1.0 0 1 2
#> 3 1.5 1 0 3
#> 4 1.5 1 0 3
#> 5 1.5 0 1 4
#> 6 1.5 0 1 4
#> 7 2.0 1 0 5
#> 8 2.0 1 0 5
#> 9 2.0 0 1 6
#> 10 3.0 1 0 7
EDIT ensure dups are continuous:
sample.df <- sample.df[with(sample.df, order(fact2,fact1,cont.var)),]
sample.df$replicate <- cumsum(!duplicated(sample.df))
sample.df
#> cont.var fact1 fact2 replicate
#> 1 1.0 1 0 1
#> 3 1.5 1 0 2
#> 4 1.5 1 0 2
#> 7 2.0 1 0 3
#> 8 2.0 1 0 3
#> 10 3.0 1 0 4
#> 2 1.0 0 1 5
#> 5 1.5 0 1 6
#> 6 1.5 0 1 6
#> 9 2.0 0 1 7