I'm trying to create an algorithm in R that can iterate through a long form data frame and for each 'value', look through the full data frame, create a subset based on the 'Date' column for the past fiscal year, and then get the % of values for that variable that are below it. And then get the % below specific to that industry as well.
My data is currently structured in this format:
Date | Industry | variable | value |
---|---|---|---|
2022-02-01 | Industry1 | variable1 | 43.2 |
2022-04-01 | Industry1 | variable2 | 17.8 |
2022-06-01 | Industry2 | variable3 | 22.5 |
2021-12-01 | Industry3 | variable1 | 98.3 |
My code is currently below. It works fine and gets me the desired values but it runs very slowly and would take hours to run through the total 500k rows. I'm wondering if there is maybe some sorting algorithm that could produce a quicker result?
Any help would be very appreciated. Thank you!
for (row in 1:nrow(df)) {
checkDate <- df[row, "Date"]
industry <- df[row, "Industry"]
variable <- df[row, "variable"]
val <- as.numeric(df[row, "value"])
filtered_df <- df[which(df$variable == variable), ]
fiscal_year <- filtered_df[which(filtered_df$Date <= checkDate &
filtered_df$Date >= (checkDate - 366)), ]
industry_df <- fiscal_year[which(fiscal_year$Industry == industry), ]
df[row, "Total"] <- length(which(fiscal_year$value < val))/
length(fiscal_year$value)
df[row, "Industry"] <- length(which(industry_df$value < val))/
length(industry_df$value)
}
CodePudding user response:
I don't have R handy, and it will take a lot of work, but I'll tell you the strategy.
First, sort by variable
, then checkDate
.
Now start with 3 variables, row
, row_min
and row_max
. And also take something like rbst (it is the right data structure) and modify it to keep track of counts of the size of its left and right tree structure in addition to the tree structures. From this extra data, you an create a O(log(n))
lookup to count how many things in your tree
are less than or equal to a particular amount.
And then in pseudocode:
set row, row_min, and row_max to the start of the data frame
create your tree, sorted by value, with only row in it
while True:
while row_max.variable == row.variable
and row_max.next() exists
and row_max.next().variable == row.variable
and row_max.next().checkDate == row.checkDate:
insert row_max into tree
row_max = row_max.next()
while row_min.variable != row.variable:
delete row_min from tree
row_min = row_min.next()
while row_min.checkDate < row.checkDate - 266:
delete row_min from tree
row_min = row_min.next()
row.total = (count of things in tree below row.value) / (position of row_max - position of row_min 1)
if row.next() exists:
row = row.next()
else
break # EXIT LOOP HERE!
You'll then need to sort by industry
, variable
, and checkDate
, repeat the logic (except with checks on industry
in addition to variable
), and calculate your Industry
calculation.
This is a lot of work but will take the calculation from O(n^2)
to O(n log(n))
.
CodePudding user response:
I don't have time to provide a total solution. But you may want to consider using the ecdf
(empirical cumulative distribution function) to generate the percentile values. Here is an example using mtcars
with mpg
as the target metric:
cars <- mtcars
Fn <- ecdf(cars$mpg)
cars$ecdf <- Fn(cars$mpg)
head(cars)
mpg cyl disp hp drat wt qsec vs am gear carb ecdf
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.62500
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.62500
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 0.78125
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 0.68750
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.46875
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 0.43750
The ecdf
column in the dataframe is the fraction of mpg
values that are below the row mpg
value.
Your complete solution would group_by
extracted year
AND variable
and also group_by
year
AND Industry
AND variable
, applying the ecdf
function to the value
column for each cluster of groups.