I have a data frame called "all_trip" which contains duplicate and non-duplicate entries. I use the duplicate()
function on it, and it works except it outputs one entry that shouldn't be there.
> print_rows <- all_trip[1365:1370,]
> print_rows
# A tibble: 6 x 11
ride_id rideable_type started_at ended_at start_station_n~ end_station_name
<chr> <chr> <dttm> <dttm> <chr> <chr>
1 E1BC31FBB70B9296 classic_bike 2021-06-18 17:20:09 2021-06-18 17:27:13 Broadway & Wils~ Sheridan Rd & I~
2 69ACAA3687B7747E classic_bike 2021-06-18 17:20:09 2021-06-18 17:27:13 Broadway & Wils~ Sheridan Rd & I~
3 C24A66453F0C81DC docked_bike 2021-06-18 18:02:23 2021-06-18 18:15:14 Michigan Ave & ~ Michigan Ave & ~
4 D81BC7FFE8502818 electric_bike 2021-06-18 18:02:23 2021-06-18 18:15:14 Wolcott Ave & P~ Halsted St & Ar~
5 B67FA4D8DCC9BBFE classic_bike 2021-06-18 18:16:37 2021-06-18 18:40:39 Damen Ave & Pie~ Kedzie Ave & Pa~
6 4ECDC385B60B8D25 classic_bike 2021-06-18 18:16:37 2021-06-18 18:40:39 Lakefront Trail~ Glenwood Ave & ~
# ... with 5 more variables: start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
# member_casual <chr>
> same_times_and_place <- all_trip[duplicated(all_trip[c("rideable_type", "start_station_name", "end_station_name", "start_lat", "start_lng", "end_lat", "end_lng", "member_casual")]), ]
> print_duplicate_rows <- same_times_and_place[291:293,]
# A tibble: 3 x 11
ride_id rideable_type started_at ended_at start_station_n~ end_station_name
<chr> <chr> <dttm> <dttm> <chr> <chr>
1 16921B172C161F30 docked_bike 2021-06-18 15:53:18 2021-06-18 18:03:09 LaSalle St & Il~ Sheffield Ave &~
2 C24A66453F0C81DC docked_bike 2021-06-18 18:02:23 2021-06-18 18:15:14 Michigan Ave & ~ Michigan Ave & ~
3 15DE43740FDDCCC1 classic_bike 2021-06-18 18:28:38 2021-06-18 18:37:07 Wells St & Ever~ Rush St & Cedar~
# ... with 5 more variables: start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
# member_casual <chr>
Michigan and Wolcott are different, yet the duplicate() function thinks they are the same. What's stranger is that Michigan is output instead of Wolcott if the function thought they were the same thing.
I used dput
on all_trip[1367:1372,]
trip_data <-
structure(
list(
ride_id = c(
"C24A66453F0C81DC",
"D81BC7FFE8502818",
"B67FA4D8DCC9BBFE",
"4ECDC385B60B8D25",
"A9ECC363F64D0767",
"15DE43740FDDCCC1"
),
rideable_type = c(
"docked_bike",
"electric_bike",
"classic_bike",
"classic_bike",
"classic_bike",
"classic_bike"
),
started_at = structure(
c(
1624039343,
1624039343,
1624040197,
1624040197,
1624040918,
1624040918
),
tzone = "UTC",
class = c("POSIXct",
"POSIXt")
),
ended_at = structure(
c(
1624040114,
1624040114,
1624041639,
1624041639,
1624041427,
1624041427
),
tzone = "UTC",
class = c("POSIXct",
"POSIXt")
),
start_station_name = c(
"Michigan Ave & Oak St",
"Wolcott Ave & Polk St",
"Damen Ave & Pierce Ave",
"Lakefront Trail & Wilson Ave",
"Wells St & Evergreen Ave",
"Wells St & Evergreen Ave"
),
end_station_name = c(
"Michigan Ave & Oak St",
"Halsted St & Archer Ave",
"Kedzie Ave & Palmer Ct",
"Glenwood Ave & Morse Ave",
"Rush St & Cedar St",
"Rush St & Cedar St"
),
start_lat = c(
41.90096,
41.8712378333333,
41.9093960065,
41.965845,
41.906724,
41.906724
),
start_lng = c(
-87.623777,
-87.6736628333333,
-87.6776919292,-87.645361,
-87.63483,
-87.63483
),
end_lat = c(
41.90096,
41.8472958333333,
41.921525,
42.00797192287,
41.90230870122,
41.90230870122
),
end_lng = c(
-87.623777,-87.646736,
-87.707322,
-87.6655023944,
-87.627690528,
-87.627690528
),
member_casual = c("casual", "member", "casual", "member",
"member", "member")
),
row.names = c(NA,-6L),
class = c("tbl_df",
"tbl", "data.frame")
)
CodePudding user response:
Do it this way
trip_data[trip_data %>% select(-ride_id) %>% duplicated(),]
output
# A tibble: 1 x 11
ride_id rideable_type started_at ended_at start_station_name end_station_name start_lat start_lng end_lat end_lng member_casual
<chr> <chr> <dttm> <dttm> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 15DE43740FDDCCC1 classic_bike 2021-06-18 18:28:38 2021-06-18 18:37:07 Wells St & Evergreen Ave Rush St & Cedar St 41.9 -87.6 41.9 -87.6 member
This seems to be the correct result.