I have a data frame but it has many overlapping rows. However, I cannot erase it because not all values are equal. data consists of 5 * 18000000 values.
date station_number station latitude longitude
0101 11428 hansung 0 127.5
0101 11428 hansung 0 127.7
0101 11374 bookmuseum 0 127.3
0101 11380 mokryeon 14 127.9
0101 11380 mokryeon 14 128.1
0101 11388 healthcent 86 126.1
I want to erase row number 2 and row number 5 and all the other rows that have same date, station_number and station name. Or I want to unify rows of same values anyway. I need your help.
CodePudding user response:
You could remove rows with duplicated
values across multiple columns like this:
df [!duplicated(df[c(1:3)]),]
#> date station_number station latitude longitude
#> 1 101 11428 hansung 0 127.5
#> 3 101 11374 bookmuseum 0 127.3
#> 4 101 11380 mokryeon 14 127.9
#> 6 101 11388 healthcent 86 126.1
library(dplyr)
df %>%
distinct(date, station_number, station, .keep_all = TRUE)
#> date station_number station latitude longitude
#> 1 101 11428 hansung 0 127.5
#> 2 101 11374 bookmuseum 0 127.3
#> 3 101 11380 mokryeon 14 127.9
#> 4 101 11388 healthcent 86 126.1
Created on 2022-12-10 with reprex v2.0.2
Data:
df <- read.table(text = 'date station_number station latitude longitude
0101 11428 hansung 0 127.5
0101 11428 hansung 0 127.7
0101 11374 bookmuseum 0 127.3
0101 11380 mokryeon 14 127.9
0101 11380 mokryeon 14 128.1
0101 11388 healthcent 86 126.1', header = TRUE)
CodePudding user response:
There is some arbitrariness in your deletion rule, since some rows are identical on some columns but nor others. I would recommend using groups_by(same_columns) %>% slice_*()
, where you can choose between slice_min()
, slice_head()
etc, which will make your rule explicit.
Using this, you will realize that the solution given by @Quinten, with distinct(date, station_number, station, .keep_all = TRUE)
, corresponds to using slice_head(n=1)
, that is take the first row of duplicates. But you could also choose, to take the last slice_tail(n=1)
, or the one where longitude
is highest (slice_max(longitude)
), etc.
library(dplyr, warn.conflicts = FALSE)
df <- read.table(text = 'date station_number station latitude longitude
0101 11428 hansung 0 127.5
0101 11428 hansung 0 127.7
0101 11374 bookmuseum 0 127.3
0101 11380 mokryeon 14 127.9
0101 11380 mokryeon 14 128.1
0101 11388 healthcent 86 126.1', header = TRUE)
df %>%
group_by(date, station_number, station) %>%
slice_min(longitude) %>%
ungroup()
#> # A tibble: 4 × 5
#> date station_number station latitude longitude
#> <int> <int> <chr> <int> <dbl>
#> 1 101 11374 bookmuseum 0 127.
#> 2 101 11380 mokryeon 14 128.
#> 3 101 11388 healthcent 86 126.
#> 4 101 11428 hansung 0 128.
Created on 2022-12-10 by the reprex package (v2.0.1)