What I need to do:
I have a dataframe where I am trying to find the mean of ride_duration in 3 instances (mean, mean for "customer" and mean for "subscriber" in user_type.
When running head() this is what I see.
head(all_trips)
# A tibble: 6 × 8
trip_id start_time end_time bike_id from_station_id to_station_id user_type ride_duration
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 22081306 3/18/2019 22:36 3/18/2019 2… 4836 233 233 Customer 0:01:01
2 21965689 2/25/2019 19:44 2/25/2019 1… 5568 623 623 Customer 0:01:01
3 21763499 1/5/2019 7:49 1/5/2019 7:… 5905 45 45 Customer 0:01:01
4 21942654 2/20/2019 12:40 2/20/2019 1… 5940 3 3 Customer 0:01:01
5 21839414 1/18/2019 16:22 1/18/2019 1… 3925 424 424 Customer 0:01:03
6 22118075 3/23/2019 15:54 3/23/2019 1… 6025 34 34 Customer 0:01:04
When trying to convert ride_duration to numeric, it changes the entire column to NAs. The same thing happens when I try to convert start_time or end_time to numeric so I can then use difftime() to come up with a new ride_duration column.
What I have tried:
all_trips$ride_duration <- as.numeric(as.character(all_trips$ride_duration))
Warning message:
NAs introduced by coercion
> all_trips$ride_duration <- as.numeric(paste(all_trips$ride_duration))
Warning message:
NAs introduced by coercion
transform(all_trips, ride_duration= as.numeric(ride_duration))
The one above just returned a preview of the table with NAs all down ride_duration.
Then I tried to remove the : symbols and then convert to numeric.
all_trips$ride_duration <-gsub(":","",as.character(all_trips$ride_duration))
> head(all_trips)
# A tibble: 6 × 8
trip_id start_time end_time bike_id from_station_id to_station_id user_type ride_duration
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
1 22081306 3/18/2019 22:36 3/18/2019 2… 4836 233 233 Customer 00101
2 21965689 2/25/2019 19:44 2/25/2019 1… 5568 623 623 Customer 00101
3 21763499 1/5/2019 7:49 1/5/2019 7:… 5905 45 45 Customer 00101
4 21942654 2/20/2019 12:40 2/20/2019 1… 5940 3 3 Customer 00101
5 21839414 1/18/2019 16:22 1/18/2019 1… 3925 424 424 Customer 00103
6 22118075 3/23/2019 15:54 3/23/2019 1… 6025 34 34 Customer 00104
> all_trips$ride_duration <- as.numeric(as.character(all_trips$ride_duration))
Warning message:
NAs introduced by coercion
> head(all_trips)
# A tibble: 6 × 8
trip_id start_time end_time bike_id from_station_id to_station_id user_type ride_duration
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 22081306 3/18/2019 22:36 3/18/2019 2… 4836 233 233 Customer 101
2 21965689 2/25/2019 19:44 2/25/2019 1… 5568 623 623 Customer 101
3 21763499 1/5/2019 7:49 1/5/2019 7:… 5905 45 45 Customer 101
4 21942654 2/20/2019 12:40 2/20/2019 1… 5940 3 3 Customer 101
5 21839414 1/18/2019 16:22 1/18/2019 1… 3925 424 424 Customer 103
6 22118075 3/23/2019 15:54 3/23/2019 1… 6025 34 34 Customer 104
> mean(all_trips$ride_duration)
[1] NA
> mean(all_trips$ride_duration, na.rm = TRUE)
[1] 2989.162
>
I still got an error message showing me some items in there are NAs, but I am concerned because after I converted to numeric it removed some zeroes from my HH:MM:SS time format so I am worried it may be calculating incorrectly.
I have also tried converting my original 4 dataframes q1-4 into dates in a new column with:
q1$date <- as.Date(q1$start_time, "%m/%d/%Y %H:%M:%S")
but it returns values of NAs
Repeating Question How can I convert ride_duration to numeric without creating NAs?
Also is there a way to convert back to what I had before running the as.numeric
without rerunning all of my code loading and binding data frames?
CodePudding user response:
use as.ITime
from data.table
:
as.numeric(data.table::as.ITime(times))
[1] 61 61 61 61 63 64
times <- c('0:01:01', '0:01:01', '0:01:01', '0:01:01', '0:01:03', '0:01:04')
CodePudding user response:
Convert start and end times to use difftime()
To use difftime() on the end and start times, try converting them to POSIXct via something like
all_trips$start_time_posix <- as.POSIXct(all_tripsstart_time, format = "%m/%d/%Y %H:%M")
as also seen on other discussions, i.e. here.
Convert ride_duration directly
To convert the ride_duration to a numeric, we first have to know the units and then do something like this:
duration_secs <- sum(as.numeric(strsplit(ride_duration, ":")[[1]]) * c(360, 60, 1))
This just works on a single value, but the pattern is - assuming ride_duration is in HOURS:MINUTES:SECONDS and the desired unit is seconds - to first split the string on the ":" and then multiply each of the resulting numbers to get three values in the desired unit and then sum them up.
Because I am too used to data.table, I would do this for the whole column as
library(data.table)
all_trips <- as.data.table(all_trips)
then add the columns of the individual units
all_trips[, c("duration_h", "duration_min", "duration_sec") :=
tstrsplit(ab, ":", fixed=TRUE, type.convert = as.numeric)]
and then sum them up in a new column
tempdt[, duration_total :=
duration_h * 360 duration_min * 60 duration_sec ]
Afterwards, you can delete the individual columns individually with
all_trips[, duration_h := NULL]
or all at once like
colsToDelete <- c("duration_h", "duration_min", "duration_sec")
all_trips[, (colsToDelete) := NULL]
There are likely more elegant ways without inserting the rows for each time unit, but maybe this is a sufficient and intelligible example to get you running.