In my dataset, I have observations on different people, for each month, in different places. I want to create a new column which is: The average of value
for last month for all places the person visited other than the place of the current row. So for example, row 6 would be the average of rows 2-5 (last month in places that are not A
).
In my actual dataset, the data is not balanced -- and users are observed for different months -- so any solution should not necessarily rely on that.
library(tidyr)
# Generate people, places, and dates
people <- letters[1:3]
places <- LETTERS[1:5]
months <- seq(as.Date("2020/01/04"), by = "month", length.out = 3)
# Now cross data so there is observation for each person, for each place, for each month
crossed <- crossing(people, dates, places)
# Add random values
crossed$value <- rnorm(nrow(crossed), 2)
Here is what the data looks like
people dates places value
<chr> <date> <chr> <dbl>
1 a 2020-01-04 A 2.94
2 a 2020-01-04 B 1.74
3 a 2020-01-04 C 2.68
4 a 2020-01-04 D 3.96
5 a 2020-01-04 E 0.821
6 a 2020-02-04 A 1.86
7 a 2020-02-04 B 1.04
8 a 2020-02-04 C 1.51
9 a 2020-02-04 D 3.62
10 a 2020-02-04 E 1.81
CodePudding user response:
This works if the dates for each month are always exactly one month apart. If not, you’ll have to tinker with dates == .y - months(1)
, but the general approach will be the same.
library(tidyverse)
set.seed(1)
crossed %>%
group_by(people) %>%
mutate(
new_val = map2_dbl(
places,
dates,
~ mean(value[places != .x & dates == .y - months(1)]))
) %>%
ungroup()
Output:
# A tibble: 45 x 5
people dates places value new_val
<chr> <date> <chr> <dbl> <dbl>
1 a 2020-01-04 A 1.37 NaN
2 a 2020-01-04 B 2.18 NaN
3 a 2020-01-04 C 1.16 NaN
4 a 2020-01-04 D 3.60 NaN
5 a 2020-01-04 E 2.33 NaN
6 a 2020-02-04 A 1.18 2.32
7 a 2020-02-04 B 2.49 2.12
8 a 2020-02-04 C 2.74 2.37
9 a 2020-02-04 D 2.58 1.76
10 a 2020-02-04 E 1.69 2.08
# ... with 35 more rows
CodePudding user response:
If I understood you correctly this solution should work.
library(tidyverse)
crossed <- crossing(people, months, places)
# use a helper variable which is always the first day of the month
crossed2 <- crossed %>%
dplyr::mutate(hmonths = lubridate::floor_date(months, unit = "months"))
crossed2 %>%
# generate all combination by people
dplyr::full_join(crossed2, by = "people", suffix = c("","_b")) %>%
# reduce to the combinations you are looking for
dplyr::filter(hmonths > hmonths_b & difftime(hmonths_b, hmonths, "days") < 32 & places != places_b) %>%
# build groupings
dplyr::group_by(people, months, places, value) %>%
# calculate the mean (possibly use na.rm = TRUE)
dplyr::summarise(op_mean = mean(value_b)) %>%
# always ungroup to prevent unwanted behaviour down stream
dplyr::ungroup() %>%
# as some data will not match the filter we use an inverted left join from the original data
dplyr::right_join(crossed, by = c("people", "months", "places", "value")) %>%
dplyr::arrange(people, months, places)
# A tibble: 45 x 5
people months places value op_mean
<chr> <date> <chr> <dbl> <dbl>
1 a 2020-01-04 A 1.24 NA
2 a 2020-01-04 B 2.47 NA
3 a 2020-01-04 C 0.999 NA
4 a 2020-01-04 D 1.67 NA
5 a 2020-01-04 E 2.96 NA
6 a 2020-02-04 A 2.51 2.03
7 a 2020-02-04 B 2.35 1.72
8 a 2020-02-04 C 1.47 2.09
9 a 2020-02-04 D 3.35 1.92
10 a 2020-02-04 E 1.12 1.59
# ... with 35 more rows