I have a dataframe like this:
customer= c('1530','1530','1530','1531','1531','1532')
month = c('2021-10-01','2021-11-01','2021-12-01','2021-11-01','2021-12-01','2021-12-01')
month_number = c(1,2,3,1,2,1)
df <- data.frame('customer_id'=customer, entry_month=month)
df
| customer_id| entry_month|
| ---------- | ---------- |
1| 1530 | 2021-10-01 |
2| 1530 | 2021-11-01 |
3| 1530 | 2021-12-01 |
4| 1531 | 2021-11-01 |
5| 1531 | 2021-12-01 |
6| 1532 | 2021-12-01 |
I need to create a column that indicates the number of the month since the customer joined. Here is my desired output:
new_df <- data.frame('customer_id'=customer, 'month'=month, 'month_number'=month_number)
new_df
| customer_id| entry_month| month_number |
| ---------- | ---------- |--------------|
1| 1530 | 2021-10-01 | 1 |
2| 1530 | 2021-11-01 | 2 |
3| 1530 | 2021-12-01 | 3 |
4| 1531 | 2021-11-01 | 1 |
5| 1531 | 2021-12-01 | 2 |
6| 1532 | 2021-12-01 | 1 |
CodePudding user response:
You can convert entry_month
to date
format and then simply use first
:
library(dplyr)
df %>%
group_by(customer_id) %>%
mutate(
entry_month = as.Date(entry_month),
nmonth = round(as.numeric(entry_month - first(entry_month)) / 30) 1,
)
# A tibble: 6 x 3
# Groups: customer_id [3]
customer_id entry_month nmonth
<chr> <date> <dbl>
1 1530 2021-10-01 1
2 1530 2021-11-01 2
3 1530 2021-12-01 3
4 1531 2021-11-01 1
5 1531 2021-12-01 2
6 1532 2021-12-01 1
Note that this works if entry_month
is always the first day in a month. Otherwise you will have to specify what exactly one month means. E.g. if the first entry is in 2021-10-20
and the second one is in 2021-11-10
what would be desired outcome of nmonth
?
CodePudding user response:
This takes the year-month part of the date and counts distinct values.
I extended the example to include a repeated month.
library(dplyr)
df %>%
group_by(customer_id) %>%
arrange(entry_month, .by_group=T) %>%
mutate(month_number = cumsum(
!duplicated(strftime(entry_month, "%Y-%m")))) %>%
ungroup()
# A tibble: 7 × 3
customer_id entry_month month_number
<chr> <chr> <int>
1 1530 2021-10-01 1
2 1530 2021-10-12 1
3 1530 2021-11-01 2
4 1530 2021-12-01 3
5 1531 2021-11-01 1
6 1531 2021-12-01 2
7 1532 2021-12-01 1
Data
df <- structure(list(customer_id = c("1530", "1530", "1530", "1530",
"1531", "1531", "1532"), entry_month = c("2021-10-01", "2021-10-12",
"2021-11-01", "2021-12-01", "2021-11-01", "2021-12-01", "2021-12-01"
)), row.names = c(NA, -7L), class = "data.frame")
CodePudding user response:
Optionally you can use data.table
package:
library(data.table)
dt <- setDT(df)
dt[, entry_month := as.IDate(entry_month)] # Tranform the column as "IDate"
dt2 <- dt[, seq_along(entry_month), by = customer_id] # Create the sequence
dt[, mont_number := dt2$V1] # Include into the datatable
dt
Output:
customer_id entry_month mont_number
1: 1530 2021-10-01 1
2: 1530 2021-11-01 2
3: 1530 2021-12-01 3
4: 1531 2021-11-01 1
5: 1531 2021-12-01 2
6: 1532 2021-12-01 1