I am trying to clean up the dataset by removing unneeded rows. here is a sample of my data sets: the first image shows the dataset and the second image shows what I am trying to achieve. so I am trying to delete all the rows that have a similar ID but only leave the one on the top.
CodePudding user response:
You can use group_by
with a cumsum
counter and then filter out all subsequent IDs:
df %>%
group_by(ID) %>%
mutate(counter = cumsum(!is.na(ID))) %>%
ungroup() %>%
filter(counter == 1) %>%
select(-counter)
CodePudding user response:
We could group by ID and filter on the minimum, i.e. earliest, TIME.
> library(dplyr)
> df <- data.frame(LOCATION=c("A" ,"B", "C", "D", "E", "F", "G"),
TIME=c("13:00", "13:20", "13:25","13:32","13:50", "13:53", "13:58"),
ID=c("2V51","2Y89","2Y89","2Y89","2T33","2T33","2U99"))
>
> df
LOCATION TIME ID
1 A 13:00 2V51
2 B 13:20 2Y89
3 C 13:25 2Y89
4 D 13:32 2Y89
5 E 13:50 2T33
6 F 13:53 2T33
7 G 13:58 2U99
>
> df <- df %>%
group_by(ID) %>%
filter(TIME == min(TIME))
>
> df
# A tibble: 4 x 3
# Groups: ID [4]
LOCATION TIME ID
<chr> <chr> <chr>
1 A 13:00 2V51
2 B 13:20 2Y89
3 E 13:50 2T33
4 G 13:58 2U99
CodePudding user response:
The base function "duplicated" can remove duplicates. It gives FALSE for the first occurrence, and TRUE for the duplicates.
df <- df %>% filter(!duplicated(ID))