how can I remove some rows that are partial duplicates in r?-CodePudding

I have a data frame but it has many overlapping rows. However, I cannot erase it because not all values are equal. data consists of 5 * 18000000 values.

date     station_number    station   latitude   longitude
0101      11428            hansung      0         127.5
0101      11428            hansung      0         127.7
0101      11374           bookmuseum    0         127.3
0101      11380            mokryeon     14        127.9
0101      11380            mokryeon     14        128.1
0101      11388            healthcent   86        126.1

I want to erase row number 2 and row number 5 and all the other rows that have same date, station_number and station name. Or I want to unify rows of same values anyway. I need your help.

CodePudding user response：

You could remove rows with duplicated values across multiple columns like this:

df [!duplicated(df[c(1:3)]),]
#>   date station_number    station latitude longitude
#> 1  101          11428    hansung        0     127.5
#> 3  101          11374 bookmuseum        0     127.3
#> 4  101          11380   mokryeon       14     127.9
#> 6  101          11388 healthcent       86     126.1

library(dplyr)
df %>%
  distinct(date, station_number, station, .keep_all = TRUE)
#>   date station_number    station latitude longitude
#> 1  101          11428    hansung        0     127.5
#> 2  101          11374 bookmuseum        0     127.3
#> 3  101          11380   mokryeon       14     127.9
#> 4  101          11388 healthcent       86     126.1

^{Created on 2022-12-10 with reprex v2.0.2}

Data:

df <- read.table(text = 'date     station_number    station   latitude   longitude
0101      11428            hansung      0         127.5
0101      11428            hansung      0         127.7
0101      11374           bookmuseum    0         127.3
0101      11380            mokryeon     14        127.9
0101      11380            mokryeon     14        128.1
0101      11388            healthcent   86        126.1', header = TRUE)

CodePudding user response：

There is some arbitrariness in your deletion rule, since some rows are identical on some columns but nor others. I would recommend using groups_by(same_columns) %>% slice_*(), where you can choose between slice_min(), slice_head() etc, which will make your rule explicit.

Using this, you will realize that the solution given by @Quinten, with distinct(date, station_number, station, .keep_all = TRUE), corresponds to using slice_head(n=1), that is take the first row of duplicates. But you could also choose, to take the last slice_tail(n=1), or the one where longitude is highest (slice_max(longitude)), etc.

library(dplyr, warn.conflicts = FALSE)

df <- read.table(text = 'date     station_number    station   latitude   longitude
0101      11428            hansung      0         127.5
0101      11428            hansung      0         127.7
0101      11374           bookmuseum    0         127.3
0101      11380            mokryeon     14        127.9
0101      11380            mokryeon     14        128.1
0101      11388            healthcent   86        126.1', header = TRUE)

df %>%
  group_by(date, station_number, station) %>% 
  slice_min(longitude) %>% 
  ungroup()
#> # A tibble: 4 × 5
#>    date station_number station    latitude longitude
#>   <int>          <int> <chr>         <int>     <dbl>
#> 1   101          11374 bookmuseum        0      127.
#> 2   101          11380 mokryeon         14      128.
#> 3   101          11388 healthcent       86      126.
#> 4   101          11428 hansung           0      128.

^{Created on 2022-12-10 by the reprex package (v2.0.1)}