I have a number of excel spreadsheets I'm iterating through with payments on a given date and an associated number of months of service for the payment.
e.g.
Product | Cost | License Date Start | License length in months | Monthly cost |
---|---|---|---|---|
Product A | 3000 | January 2022 | 3 | 1000 |
Product B | 2400 | March 2022 | 4 | 600 |
Product B | 2400 | Feb 2022 | 3 | 800 |
Product A | 2000 | March 2022 | 2 | 1000 |
What I would like to do is create a new dataframe, shaped around the months, with the broken down individual and total monthly cost of each product, based on the length of the license.
For example, in the table above, the cost of the first instance Product A is 3000 and runs for 3 months, making it 1000/month and running through January, February and March. For the second instance of Product A, it is again 1000/month but runs through March and April, so there is overlap, with March have a total cost of Product A of 2000.
In the end, my outcome should look like this:
Date | Product A cost | Product B cost | Product C cost | Total cost |
---|---|---|---|---|
January 2022 | 1000 | 0 | 0 | 1000 |
February 2022 | 1000 | 800 | 0 | 1800 |
March 2022 | 2000 | 2400 | 0 | 4400 |
April 2022 | 1000 | 2400 | 0 | 3400 |
May 2022 | 1000 | 600 | 0 | 600 |
June 2022 | 1000 | 600 | 0 | 600 |
I am struggling to find the best way to iterate through the original data and generate the end result. My general approach is to use apply
to iterate through the original dataframe, generating rows based on the number of months, start date, and monthly cost, before then attempting to reshape into relevant columns, but I am having trouble getting apply to return and am concerned that this isn't the most efficient way to do this.
Any help much appreciated.
CodePudding user response:
Using your input df
as a starting point, where I changed the License Date Start
into the corresponding month number, you can uncount the occurences by License length in months
.
input_df <- data.frame(
Product = c("Product A", "Product B", "Product B", "Product A"),
month_start = c(1, 3, 2, 3),
License_lenght = c(3, 4, 3, 2),
Monthly = c(1000, 600, 800, 1000)
)
You then want to keep track of every row, as one product can have multiple starting months. In this example I used row_number()
output_df <- input_df %>%
mutate(rn=row_number()) %>%
group_by(Product, rn) %>%
uncount(License_lenght) %>%
mutate(month_active = row_number() month_start - 1) %>%
group_by(Product, month_active) %>%
summarize(Product_monthly_cost = sum(Monthly)) %>%
group_by(month_active) %>%
mutate(Total_cost = sum(Product_monthly_cost)) %>%
pivot_wider(names_from = Product, values_from = Product_monthly_cost) %>%
replace(is.na(.), 0)
I uncount per product type and row number rn
. Then I define every month in which the license is active, and sum the monthly cost per product and active month. Then group per active month to determine the total monthly cost. Finally I pivot_wider
per product and active_month
just like the desired output dataframe you posted and replace the na's with 0.
The result is
> output_df
# A tibble: 6 × 4
month_active Total_cost `Product A` `Product B`
<dbl> <dbl> <dbl> <dbl>
1 1 1000 1000 0
2 2 1800 1000 800
3 3 3400 2000 1400
4 4 2400 1000 1400
5 5 600 0 600
6 6 600 0 600
CodePudding user response:
I think you have to be a little bit careful with your calculations regarding your dates. In your example the start and end dates are all in the same year, but if your starting month is December and the license lasts more than a month, then you have to pay attention to the calcuation of the month and year. For this you can use the lubridate
-package. I added one row to your example for December 2021 to demonstrate it:
library(tidyverse)
library(lubridate)
df <- read.table(text = "Product Cost License Date Start License length in months Monthly cost
Product A 3000 January 2022 3 1000
Product B 2400 March 2022 4 600
Product B 2400 Feb 2022 3 800
Product A 2000 March 2022 2 1000
Product C 2000 December 2021 2 1000", sep = "\t", header = TRUE)
df.result <- df %>%
mutate(id = row_number(), Date = my(License.Date.Start)) %>%
group_by(id, Product, Monthly.cost) %>%
summarise(Date = Date %m % months((1:License.length.in.months) - 1)) %>%
pivot_wider(id_cols = Date, names_from = Product, values_from = Monthly.cost, values_fn = sum, values_fill = 0) %>%
arrange(Date) %>%
mutate(Total = rowSums(select(., contains("Product"))), Date = format(Date, "%B %Y"))
df.result
#> # A tibble: 7 x 5
#> Date `Product A` `Product B` `Product C` Total
#> <chr> <int> <int> <int> <dbl>
#> 1 December 2021 0 0 1000 1000
#> 2 January 2022 1000 0 1000 2000
#> 3 February 2022 1000 800 0 1800
#> 4 March 2022 2000 1400 0 3400
#> 5 April 2022 1000 1400 0 2400
#> 6 May 2022 0 600 0 600
#> 7 June 2022 0 600 0 600
Created on 2022-10-17 by the reprex package (v2.0.1)