Home > Mobile >  Fill up rows in a dataframe in R
Fill up rows in a dataframe in R

Time:11-26

I have a dataframe with information of purchases made by clients from May (202105) to October 2021 (202110). The clients don't buy every month, but I want to fill up that information with zeroes. My data looks like this:

soex <- data.frame(client_id = c("aaa","bbb","bbb","ccc","ccc","ddd","eee","eee","eee"),
               v1 = c("xxx","xxx","xxx","yyy","yyy","xxx","yyy","xxx","yyy"),
               first_buy = c("202105","202107","202107","202106","202106","202110","202107","202107","202107"),
               sales_date = c("202105","202107","202109","202106","202110","202110","202107","202108","202109"),
               qt_prod1 = c(10,60,30,2,45,11,14,167,145),
               qt_prod2 = c(12,324,433,221,312,312,312,123,121))
  client_id  v1 first_buy sales_date qt_prod1 qt_prod2
1       aaa xxx    202105     202105       10       12
2       bbb xxx    202107     202107       60      324
3       bbb xxx    202107     202109       30      433
4       ccc yyy    202106     202106        2      221
5       ccc yyy    202106     202110       45      312
6       ddd xxx    202110     202110       11      312
7       eee yyy    202107     202107       14      312
8       eee xxx    202107     202108      167      123
9       eee yyy    202107     202109      145      121
  • client_id = client id
  • v1 = random variable
  • first_buy = year and month of the firs purchase
  • sales_date = year and month of the purchase. The first one is always the same as the first_buy
  • qi_prod1 (and 2) = amount of product bought

What I need is a dataframe that looks like this:

ideal <- data.frame(client_id = c("aaa","aaa","aaa","aaa","aaa","aaa","bbb","bbb","bbb","bbb","ccc","ccc","ccc","ccc","ccc","ddd","eee","eee","eee","eee"),
                    v1 = c("xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","xxx","yyy","yyy","yyy","yyy","yyy","xxx","yyy","xxx","yyy","yyy"),
                    first_buy = c("202105","202105","202105","202105","202105","202105","202107","202107","202107","202107","202106","202106","202106","202106","202106","202110","202107","202107","202107","202107"),
                    sales_date = c("202105","202106","202107","202108","202109","202110","202107","202108","202109","202110","202106","202107","202108","202109","202110","202110","202107","202108","202109","202110"),
                    qt_prod1 = c(10,0,0,0,0,0,60,0,0,30,2,0,0,0,45,11,14,167,145,0),
                    qt_prod2 = c(12,0,0,0,0,0,324,0,0,433,221,0,0,0,312,312,312,123,121,0))
   client_id  v1 first_buy sales_date qt_prod1 qt_prod2
1        aaa xxx    202105     202105       10       12
2        aaa xxx    202105     202106        0        0
3        aaa xxx    202105     202107        0        0
4        aaa xxx    202105     202108        0        0
5        aaa xxx    202105     202109        0        0
6        aaa xxx    202105     202110        0        0
7        bbb xxx    202107     202107       60      324
8        bbb xxx    202107     202108        0        0
9        bbb xxx    202107     202109        0        0
10       bbb xxx    202107     202110       30      433
11       ccc yyy    202106     202106        2      221
12       ccc yyy    202106     202107        0        0
13       ccc yyy    202106     202108        0        0
14       ccc yyy    202106     202109        0        0
15       ccc yyy    202106     202110       45      312
16       ddd xxx    202110     202110       11      312
17       eee yyy    202107     202107       14      312
18       eee xxx    202107     202108      167      123
19       eee yyy    202107     202109      145      121
20       eee yyy    202107     202110        0        0

My problem comes specially when I have to consider the first_buy variable. As you can see in the case of the "bbb" client, I don't want data to start on May 2021... I want it to start on the first_buy month until 202110.

My other problem is to fill up the information of the V1 variable with the information of the sales_date. If you see, for example, the case of the "eee" client, she didn't buy in 202110, but in V1 she have the information of 202109

Thanks,

CodePudding user response:

A tidyverse option -

  • Change sales_date to date object
  • For each client_id create a monthly sequence of dates
  • fill first_buy and v1 columns.
library(dplyr)
library(tidyr)
library(lubridate)

soex %>%
  mutate(sales_date = ymd(paste0(sales_date, '01'))) %>%
  group_by(client_id) %>%
  complete(sales_date = seq(min(sales_date), ymd('20211001'), by = 'month'), 
           fill = list(qt_prod1 = 0, qt_prod2 = 0)) %>%
  mutate(sales_date = format(sales_date, '%Y%m')) %>%
  fill(first_buy, v1) %>%
  ungroup()

#  client_id sales_date v1    first_buy qt_prod1 qt_prod2
#   <chr>     <chr>      <chr> <chr>        <dbl>    <dbl>
# 1 aaa       202105     xxx   202105          10       12
# 2 aaa       202106     xxx   202105           0        0
# 3 aaa       202107     xxx   202105           0        0
# 4 aaa       202108     xxx   202105           0        0
# 5 aaa       202109     xxx   202105           0        0
# 6 aaa       202110     xxx   202105           0        0
# 7 bbb       202107     xxx   202107          60      324
# 8 bbb       202108     xxx   202107           0        0
# 9 bbb       202109     xxx   202107          30      433
#10 bbb       202110     xxx   202107           0        0
#11 ccc       202106     yyy   202106           2      221
#12 ccc       202107     yyy   202106           0        0
#13 ccc       202108     yyy   202106           0        0
#14 ccc       202109     yyy   202106           0        0
#15 ccc       202110     yyy   202106          45      312
#16 ddd       202110     xxx   202110          11      312
#17 eee       202107     yyy   202107          14      312
#18 eee       202108     xxx   202107         167      123
#19 eee       202109     yyy   202107         145      121
#20 eee       202110     yyy   202107           0        0
  • Related