Add number of months in a dataframe in R-CodePudding

I have a dataframe like this:

customer= c('1530','1530','1530','1531','1531','1532')  
month =  c('2021-10-01','2021-11-01','2021-12-01','2021-11-01','2021-12-01','2021-12-01')  
month_number = c(1,2,3,1,2,1)  
df <- data.frame('customer_id'=customer, entry_month=month)  
df

| customer_id| entry_month|
| ---------- | ---------- |
1|      1530 | 2021-10-01 |
2|      1530 | 2021-11-01 |
3|      1530 | 2021-12-01 |
4|      1531 | 2021-11-01 |
5|      1531 | 2021-12-01 |
6|      1532 | 2021-12-01 |

I need to create a column that indicates the number of the month since the customer joined. Here is my desired output:

new_df <- data.frame('customer_id'=customer, 'month'=month, 'month_number'=month_number)  
new_df

| customer_id| entry_month| month_number |
| ---------- | ---------- |--------------|
1|      1530 | 2021-10-01 | 1            |
2|      1530 | 2021-11-01 | 2            |
3|      1530 | 2021-12-01 | 3            |
4|      1531 | 2021-11-01 | 1            |
5|      1531 | 2021-12-01 | 2            |
6|      1532 | 2021-12-01 | 1            |

CodePudding user response：

You can convert entry_month to date format and then simply use first:

library(dplyr)
df %>%
  group_by(customer_id) %>%
  mutate(
    entry_month = as.Date(entry_month),
    nmonth = round(as.numeric(entry_month - first(entry_month)) / 30)   1,
  )

# A tibble: 6 x 3
# Groups:   customer_id [3]
  customer_id entry_month nmonth
  <chr>       <date>       <dbl>
1 1530        2021-10-01       1
2 1530        2021-11-01       2
3 1530        2021-12-01       3
4 1531        2021-11-01       1
5 1531        2021-12-01       2
6 1532        2021-12-01       1

Note that this works if entry_month is always the first day in a month. Otherwise you will have to specify what exactly one month means. E.g. if the first entry is in 2021-10-20 and the second one is in 2021-11-10 what would be desired outcome of nmonth?

CodePudding user response：

This takes the year-month part of the date and counts distinct values.

I extended the example to include a repeated month.

library(dplyr)

df %>% 
  group_by(customer_id) %>% 
  arrange(entry_month, .by_group=T) %>% 
  mutate(month_number = cumsum(
           !duplicated(strftime(entry_month, "%Y-%m")))) %>% 
  ungroup()
# A tibble: 7 × 3
  customer_id entry_month month_number
  <chr>       <chr>              <int>
1 1530        2021-10-01             1
2 1530        2021-10-12             1
3 1530        2021-11-01             2
4 1530        2021-12-01             3
5 1531        2021-11-01             1
6 1531        2021-12-01             2
7 1532        2021-12-01             1

Data

df <- structure(list(customer_id = c("1530", "1530", "1530", "1530",
"1531", "1531", "1532"), entry_month = c("2021-10-01", "2021-10-12",
"2021-11-01", "2021-12-01", "2021-11-01", "2021-12-01", "2021-12-01"
)), row.names = c(NA, -7L), class = "data.frame")

CodePudding user response：

Optionally you can use data.table package:

library(data.table)

dt <- setDT(df)

dt[, entry_month := as.IDate(entry_month)] # Tranform the column as "IDate"

dt2 <- dt[, seq_along(entry_month), by = customer_id] # Create the sequence

dt[, mont_number := dt2$V1] # Include into the datatable

dt

Output:

 customer_id entry_month mont_number
1:        1530  2021-10-01           1
2:        1530  2021-11-01           2
3:        1530  2021-12-01           3
4:        1531  2021-11-01           1
5:        1531  2021-12-01           2
6:        1532  2021-12-01           1