Home > front end >  Duplicated function in R outputting a non-duplicate entry
Duplicated function in R outputting a non-duplicate entry

Time:09-28

I have a data frame called "all_trip" which contains duplicate and non-duplicate entries. I use the duplicate() function on it, and it works except it outputs one entry that shouldn't be there.

> print_rows <- all_trip[1365:1370,]
> print_rows

# A tibble: 6 x 11
  ride_id          rideable_type started_at          ended_at            start_station_n~ end_station_name
  <chr>            <chr>         <dttm>              <dttm>              <chr>            <chr>           
1 E1BC31FBB70B9296 classic_bike  2021-06-18 17:20:09 2021-06-18 17:27:13 Broadway & Wils~ Sheridan Rd & I~
2 69ACAA3687B7747E classic_bike  2021-06-18 17:20:09 2021-06-18 17:27:13 Broadway & Wils~ Sheridan Rd & I~
3 C24A66453F0C81DC docked_bike   2021-06-18 18:02:23 2021-06-18 18:15:14 Michigan Ave & ~ Michigan Ave & ~
4 D81BC7FFE8502818 electric_bike 2021-06-18 18:02:23 2021-06-18 18:15:14 Wolcott Ave & P~ Halsted St & Ar~
5 B67FA4D8DCC9BBFE classic_bike  2021-06-18 18:16:37 2021-06-18 18:40:39 Damen Ave & Pie~ Kedzie Ave & Pa~
6 4ECDC385B60B8D25 classic_bike  2021-06-18 18:16:37 2021-06-18 18:40:39 Lakefront Trail~ Glenwood Ave & ~
# ... with 5 more variables: start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
#   member_casual <chr>

> same_times_and_place <- all_trip[duplicated(all_trip[c("rideable_type", "start_station_name", "end_station_name", "start_lat", "start_lng", "end_lat", "end_lng", "member_casual")]), ]
> print_duplicate_rows <- same_times_and_place[291:293,]

# A tibble: 3 x 11
  ride_id          rideable_type started_at          ended_at            start_station_n~ end_station_name
  <chr>            <chr>         <dttm>              <dttm>              <chr>            <chr>           
1 16921B172C161F30 docked_bike   2021-06-18 15:53:18 2021-06-18 18:03:09 LaSalle St & Il~ Sheffield Ave &~
2 C24A66453F0C81DC docked_bike   2021-06-18 18:02:23 2021-06-18 18:15:14 Michigan Ave & ~ Michigan Ave & ~
3 15DE43740FDDCCC1 classic_bike  2021-06-18 18:28:38 2021-06-18 18:37:07 Wells St & Ever~ Rush St & Cedar~
# ... with 5 more variables: start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
#   member_casual <chr>

Michigan and Wolcott are different, yet the duplicate() function thinks they are the same. What's stranger is that Michigan is output instead of Wolcott if the function thought they were the same thing.

I used dput on all_trip[1367:1372,]

trip_data <-
  structure(
    list(
      ride_id = c(
        "C24A66453F0C81DC",
        "D81BC7FFE8502818",
        "B67FA4D8DCC9BBFE",
        "4ECDC385B60B8D25",
        "A9ECC363F64D0767",
        "15DE43740FDDCCC1"
      ),
      rideable_type = c(
        "docked_bike",
        "electric_bike",
        "classic_bike",
        "classic_bike",
        "classic_bike",
        "classic_bike"
      ),
      started_at = structure(
        c(
          1624039343,
          1624039343,
          1624040197,
          1624040197,
          1624040918,
          1624040918
        ),
        tzone = "UTC",
        class = c("POSIXct",
                  "POSIXt")
      ),
      ended_at = structure(
        c(
          1624040114,
          1624040114,
          1624041639,
          1624041639,
          1624041427,
          1624041427
        ),
        tzone = "UTC",
        class = c("POSIXct",
                  "POSIXt")
      ),
      start_station_name = c(
        "Michigan Ave & Oak St",
        "Wolcott Ave & Polk St",
        "Damen Ave & Pierce Ave",
        "Lakefront Trail & Wilson Ave",
        "Wells St & Evergreen Ave",
        "Wells St & Evergreen Ave"
      ),
      end_station_name = c(
        "Michigan Ave & Oak St",
        "Halsted St & Archer Ave",
        "Kedzie Ave & Palmer Ct",
        "Glenwood Ave & Morse Ave",
        "Rush St & Cedar St",
        "Rush St & Cedar St"
      ),
      start_lat = c(
        41.90096,
        41.8712378333333,
        41.9093960065,
        41.965845,
        41.906724,
        41.906724
      ),
      start_lng = c(
        -87.623777,
        -87.6736628333333,
        -87.6776919292,-87.645361,
        -87.63483,
        -87.63483
      ),
      end_lat = c(
        41.90096,
        41.8472958333333,
        41.921525,
        42.00797192287,
        41.90230870122,
        41.90230870122
      ),
      end_lng = c(
        -87.623777,-87.646736,
        -87.707322,
        -87.6655023944,
        -87.627690528,
        -87.627690528
      ),
      member_casual = c("casual", "member", "casual", "member",
                        "member", "member")
    ),
    row.names = c(NA,-6L),
    class = c("tbl_df",
              "tbl", "data.frame")
  )

CodePudding user response:

Do it this way

trip_data[trip_data %>% select(-ride_id) %>% duplicated(),]

output

# A tibble: 1 x 11
  ride_id          rideable_type started_at          ended_at            start_station_name       end_station_name   start_lat start_lng end_lat end_lng member_casual
  <chr>            <chr>         <dttm>              <dttm>              <chr>                    <chr>                  <dbl>     <dbl>   <dbl>   <dbl> <chr>        
1 15DE43740FDDCCC1 classic_bike  2021-06-18 18:28:38 2021-06-18 18:37:07 Wells St & Evergreen Ave Rush St & Cedar St      41.9     -87.6    41.9   -87.6 member

This seems to be the correct result.

  •  Tags:  
  • r
  • Related