I have people's trip records as a data frame in R. Which looks as below:
t_participant_id t_destination_PostCode tripReasonString tripSequence
1 304 1082 work 1
2 304 1000 work 2
3 304 1000 work 3
4 304 1000 other 4
5 304 1000 other 5
6 304 1082 other 6
7 304 1082 home 7
8 362 1070 shopping 1
9 362 1070 work 2
10 362 1070 work 3
11 362 1070 shopping 4
12 362 1180 home 5
13 362 1070 leisure 6
14 362 1180 home 7
15 482 2800 work 1
16 482 2800 work 2
17 482 1020 home 3
Basically, t_participant_id
is a person's unique ID, tripReason
is the reason for people's trip, and the t_destination_postcode
is the area that they are going to (for instance, in row 1, person 304
go to area 1082
for work
).
Different people have different numbers of trips during one day (person 304
and 362
has 7 trips in total, whereas person 482
has 3). I would like to filter out the person that the t_destination_PostCode
for the trip purpose work
is the same as other work trips under this person's activity chain.
In short, for person 304
, his first work trip is at line 1, and going to the area 1082
. The next work trip is at line 2, and going to the area 1000
. So I know he is not the person I would like to filter out. However, for person 362
, he has the first work
activity at tripSequence == 2
, heading to 1070
. The next work
trip is at tripSequence
3, and also to 1070. No other work trips in the activity chain anymore. So in this case, we would like to filter out person 362
, and record the first work tripSequence
, which is 2.
So the final result should be a data frame that says
t_participant_id firstWorkTrip
362 2
482 1
I really don't want to do this manually but not sure how can I achieve this in R. Thanks very much for your help in advance!
Here is the reprex for the dataframe
test <- data.frame(list(t_participant_id = c(304L, 304L, 304L, 304L, 304L,
304L, 304L, 362L, 362L, 362L, 362L, 362L, 362L, 362L, 482L, 482L,
482L), t_destination_PostCode = c(1082L, 1000L, 1000L, 1000L,
1000L, 1082L, 1082L, 1070L, 1070L, 1070L, 1070L, 1180L, 1070L,
1180L, 2800L, 2800L, 1020L), tripReasonString = c("work", "work",
"work", "other", "other", "other", "home", "shopping", "work",
"work", "shopping", "home", "leisure", "home", "work", "work",
"home"), tripSequence = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L)))
CodePudding user response:
df %>%
filter(tripReasonString == "work") %>%
group_by(t_participant_id) %>%
filter(n_distinct(t_destination_PostCode) == 1) %>%
summarize(first_work_trip = min(tripSequence))
# # A tibble: 2 × 2
# t_participant_id first_work_trip
# <int> <int>
# 1 362 2
# 2 482 1