Home > Blockchain >  How to remove rows with single unique ID in R?
How to remove rows with single unique ID in R?

Time:05-13

I have the dataset as coded below. For a specific set of treatment pair, year, month, level, I have assigned a unique ID. Ideally, a compelete "set" has two rows corresponding to the same unique ID. If it does not, I want to eliminate those rows.

So here, all "sets" of two sets of unique ID, except that corresponding to ID 2. How do In my original dataset, I have thousands such rows. How can I scan to remove these singeltons?

tmt.pair <- c("A","A","A","B","B","B","B")
tmt <- c("1000 C","4000 C","1000 C","1000 C","4000 C","1000 C","4000 C")
year <- c("2021","2021","2021","2021","2021","2020","2020")
month <- c("A","A","A","J","J","O","O")
level <- c("Low","Low","Up","Low","Low","Low","Low")
site <- c(1,1,2,1,1,1,1)
val <- rnorm(7,5,1)

df <- data.frame(tmt.pair, year,month, level,tmt,val)

df$ID <- cumsum(!duplicated(df[1:4]))



  tmt.pair year month level tmt    val       ID
1        A 2021     A   Low 1000 C 4.789715  1
2        A 2021     A   Low 4000 C 6.451113  1
3        A 2021     A    Up 1000 C 4.281171  2
4        B 2021     J   Low 1000 C 5.176668  3
5        B 2021     J   Low 4000 C 6.384432  3
6        B 2020     O   Low 1000 C 4.833731  4
7        B 2020     O   Low 4000 C 3.274355  4

CodePudding user response:

Using base R you can do this:

tab=table(df$ID)
df[ifelse(tab[df$ID]==1, FALSE, TRUE),] 

Output:

  tmt.pair year month level    tmt      val ID
1        A 2021     A   Low 1000 C 5.156294  1
2        A 2021     A   Low 4000 C 4.395990  1
4        B 2021     J   Low 1000 C 5.714170  3
5        B 2021     J   Low 4000 C 6.075886  3
6        B 2020     O   Low 1000 C 7.249756  4
7        B 2020     O   Low 4000 C 5.197891  4

CodePudding user response:

Another option using data.table:

library(data.table)
  
setDT(df)[,if(.N > 1) .SD, by=ID]

Output

   ID tmt.pair year month level    tmt      val
1:  1        A 2021     A   Low 1000 C 4.424811
2:  1        A 2021     A   Low 4000 C 4.556058
3:  3        B 2021     J   Low 1000 C 4.396996
4:  3        B 2021     J   Low 4000 C 3.906065
5:  4        B 2020     O   Low 1000 C 5.714706
6:  4        B 2020     O   Low 4000 C 4.891188

Or with dplyr, where we only keep IDs that have more than 1 observation:

library(dplyr)

df %>%
  group_by(ID) %>%
  filter(n() > 1)
  • Related