Home > Blockchain >  Best way to iterate through dataframe, calculate monthly value, generate new dataframe?
Best way to iterate through dataframe, calculate monthly value, generate new dataframe?

Time:10-18

I have a number of excel spreadsheets I'm iterating through with payments on a given date and an associated number of months of service for the payment.

e.g.

Product Cost License Date Start License length in months Monthly cost
Product A 3000 January 2022 3 1000
Product B 2400 March 2022 4 600
Product B 2400 Feb 2022 3 800
Product A 2000 March 2022 2 1000

What I would like to do is create a new dataframe, shaped around the months, with the broken down individual and total monthly cost of each product, based on the length of the license.
For example, in the table above, the cost of the first instance Product A is 3000 and runs for 3 months, making it 1000/month and running through January, February and March. For the second instance of Product A, it is again 1000/month but runs through March and April, so there is overlap, with March have a total cost of Product A of 2000.
In the end, my outcome should look like this:

Date Product A cost Product B cost Product C cost Total cost
January 2022 1000 0 0 1000
February 2022 1000 800 0 1800
March 2022 2000 2400 0 4400
April 2022 1000 2400 0 3400
May 2022 1000 600 0 600
June 2022 1000 600 0 600

I am struggling to find the best way to iterate through the original data and generate the end result. My general approach is to use apply to iterate through the original dataframe, generating rows based on the number of months, start date, and monthly cost, before then attempting to reshape into relevant columns, but I am having trouble getting apply to return and am concerned that this isn't the most efficient way to do this. Any help much appreciated.

CodePudding user response:

Using your input df as a starting point, where I changed the License Date Start into the corresponding month number, you can uncount the occurences by License length in months.

input_df <- data.frame(
  Product = c("Product A", "Product B", "Product B", "Product A"),
  month_start = c(1, 3, 2, 3), 
  License_lenght = c(3, 4, 3, 2),
  Monthly = c(1000, 600, 800, 1000)
)

You then want to keep track of every row, as one product can have multiple starting months. In this example I used row_number()

output_df <- input_df %>%
  mutate(rn=row_number()) %>%
  group_by(Product, rn) %>%
  uncount(License_lenght) %>%
  mutate(month_active = row_number()   month_start - 1) %>%
  group_by(Product, month_active) %>%
  summarize(Product_monthly_cost = sum(Monthly)) %>%
  group_by(month_active) %>%
  mutate(Total_cost = sum(Product_monthly_cost)) %>%
  pivot_wider(names_from = Product, values_from = Product_monthly_cost) %>%
  replace(is.na(.), 0)

I uncount per product type and row number rn. Then I define every month in which the license is active, and sum the monthly cost per product and active month. Then group per active month to determine the total monthly cost. Finally I pivot_wider per product and active_month just like the desired output dataframe you posted and replace the na's with 0.

The result is

> output_df
# A tibble: 6 × 4
  month_active Total_cost `Product A` `Product B`
         <dbl>      <dbl>       <dbl>       <dbl>
1            1       1000        1000           0
2            2       1800        1000         800
3            3       3400        2000        1400
4            4       2400        1000        1400
5            5        600           0         600
6            6        600           0         600

CodePudding user response:

I think you have to be a little bit careful with your calculations regarding your dates. In your example the start and end dates are all in the same year, but if your starting month is December and the license lasts more than a month, then you have to pay attention to the calcuation of the month and year. For this you can use the lubridate-package. I added one row to your example for December 2021 to demonstrate it:

library(tidyverse)
library(lubridate)

df <- read.table(text = "Product    Cost    License Date Start  License length in months    Monthly cost
Product A   3000    January 2022    3   1000
Product B   2400    March 2022  4   600
Product B   2400    Feb 2022    3   800
Product A   2000    March 2022  2   1000
Product C   2000    December 2021   2   1000", sep = "\t", header = TRUE)

df.result <- df %>% 
  mutate(id = row_number(), Date = my(License.Date.Start)) %>%
  group_by(id, Product, Monthly.cost) %>%
  summarise(Date = Date %m % months((1:License.length.in.months) - 1)) %>% 
  pivot_wider(id_cols = Date, names_from = Product, values_from = Monthly.cost, values_fn = sum, values_fill = 0) %>%
  arrange(Date) %>%
  mutate(Total = rowSums(select(., contains("Product"))), Date = format(Date, "%B %Y"))

df.result
#> # A tibble: 7 x 5
#>   Date          `Product A` `Product B` `Product C` Total
#>   <chr>               <int>       <int>       <int> <dbl>
#> 1 December 2021           0           0        1000  1000
#> 2 January 2022          1000           0        1000  2000
#> 3 February 2022         1000         800           0  1800
#> 4 March 2022            2000        1400           0  3400
#> 5 April 2022           1000        1400           0  2400
#> 6 May 2022                0         600           0   600
#> 7 June 2022               0         600           0   600

Created on 2022-10-17 by the reprex package (v2.0.1)

  • Related