Home > database >  Calculation for last month -- with conditions
Calculation for last month -- with conditions

Time:03-30

In my dataset, I have observations on different people, for each month, in different places. I want to create a new column which is: The average of value for last month for all places the person visited other than the place of the current row. So for example, row 6 would be the average of rows 2-5 (last month in places that are not A).

In my actual dataset, the data is not balanced -- and users are observed for different months -- so any solution should not necessarily rely on that.

library(tidyr)

# Generate people, places, and dates
people <- letters[1:3]
places <- LETTERS[1:5]
months <- seq(as.Date("2020/01/04"), by = "month", length.out = 3)

# Now cross data so there is observation for each person, for each place, for each month
crossed <- crossing(people, dates, places)

# Add random values
crossed$value <- rnorm(nrow(crossed), 2)

Here is what the data looks like

   people dates      places value
   <chr>  <date>     <chr>  <dbl>
 1 a      2020-01-04 A      2.94 
 2 a      2020-01-04 B      1.74 
 3 a      2020-01-04 C      2.68 
 4 a      2020-01-04 D      3.96 
 5 a      2020-01-04 E      0.821
 6 a      2020-02-04 A      1.86 
 7 a      2020-02-04 B      1.04 
 8 a      2020-02-04 C      1.51 
 9 a      2020-02-04 D      3.62 
10 a      2020-02-04 E      1.81 

CodePudding user response:

This works if the dates for each month are always exactly one month apart. If not, you’ll have to tinker with dates == .y - months(1), but the general approach will be the same.

library(tidyverse)
set.seed(1)

crossed %>%
  group_by(people) %>%
  mutate(
    new_val = map2_dbl(
    places, 
    dates,
    ~ mean(value[places != .x & dates == .y - months(1)]))
  ) %>%
  ungroup()

Output:

# A tibble: 45 x 5
   people dates      places value new_val
   <chr>  <date>     <chr>  <dbl>   <dbl>
 1 a      2020-01-04 A       1.37  NaN   
 2 a      2020-01-04 B       2.18  NaN   
 3 a      2020-01-04 C       1.16  NaN   
 4 a      2020-01-04 D       3.60  NaN   
 5 a      2020-01-04 E       2.33  NaN   
 6 a      2020-02-04 A       1.18    2.32
 7 a      2020-02-04 B       2.49    2.12
 8 a      2020-02-04 C       2.74    2.37
 9 a      2020-02-04 D       2.58    1.76
10 a      2020-02-04 E       1.69    2.08
# ... with 35 more rows

CodePudding user response:

If I understood you correctly this solution should work.

library(tidyverse)

crossed <- crossing(people, months, places)

# use a helper variable which is always the first day of the month
crossed2 <- crossed %>% 
    dplyr::mutate(hmonths = lubridate::floor_date(months, unit = "months")) 

crossed2 %>%
    # generate all combination by people
    dplyr::full_join(crossed2, by = "people", suffix = c("","_b")) %>% 
    # reduce to the combinations you are looking for
    dplyr::filter(hmonths > hmonths_b & difftime(hmonths_b, hmonths, "days") < 32  & places != places_b) %>% 
    # build groupings
    dplyr::group_by(people, months, places, value) %>% 
    # calculate the mean (possibly use na.rm = TRUE)
    dplyr::summarise(op_mean = mean(value_b)) %>% 
    # always ungroup to prevent unwanted behaviour down stream
    dplyr::ungroup() %>%
    # as some data will not match the filter we use an inverted left join from the original data
    dplyr::right_join(crossed, by = c("people", "months", "places", "value")) %>%
    dplyr::arrange(people, months, places)

# A tibble: 45 x 5
   people months     places value op_mean
   <chr>  <date>     <chr>  <dbl>   <dbl>
 1 a      2020-01-04 A      1.24    NA   
 2 a      2020-01-04 B      2.47    NA   
 3 a      2020-01-04 C      0.999   NA   
 4 a      2020-01-04 D      1.67    NA   
 5 a      2020-01-04 E      2.96    NA   
 6 a      2020-02-04 A      2.51     2.03
 7 a      2020-02-04 B      2.35     1.72
 8 a      2020-02-04 C      1.47     2.09
 9 a      2020-02-04 D      3.35     1.92
10 a      2020-02-04 E      1.12     1.59
# ... with 35 more rows
  • Related