I'm working with investment fund data taken from Morningstar, which provides them at share class level. For people who did not have exposure to finance/funds, no need to dive into detail, but mine is a panel data structured as following:
Fund ID | Sec ID | Net Assets | Return | Rating |
---|---|---|---|---|
A | A1 | 100 | 1% | 4 stars |
A | A2 | 200 | 1,2 % | 4 stars |
A | A3 | 150 | 0,5 % | 3 stars |
B | B1 | 50 | 1,1 % | 2 stars |
B | B2 | 120 | 0,75% | 3 stars |
C | C1 | 300 | 0,4% | 5 stars |
C | C2 | 500 | 0,55% | 4 stars |
What I need to achieve is to aggregate data at Fund level (Fund ID), so that the fund size will be the sum of the net assets of the different share classes (Sec ID). The return and the star rating at fund level will be the weighted average of both variables (star rating rounded). I'm using R and my dataset is made of over 8000 share classes therefore it's essential to get an easily scalable solution.
i.e. Fund A return would be: (0.01 * 100 0.012 * 200 0.005 * 150) / (100 200 150) = 0,92%
Fund B rating would be (2 * 50 3 * 120) / (50 120) = 2.70 rounded to 3
Any help on how to achieve such a result? How could I apply that to a panel data (with daily observation over 3 months)?
CodePudding user response:
library(data.table)
library(stringr)
df <- dput('A, A1, 100, "1%", "4 stars"
A, A2, 200, "1,2%", "4 stars"
A, A3, 150, "0,5%", "3 stars"
B, B1, 50, "1,1%", "2 stars"
B, B2, 120, "0,75%", "3 stars"
C, C1, 300, "0,4%", "5 stars"
C, C2, 500, "0,55%", "4 stars"') |>
textConnection() |>
read.csv(header = F) |>
`colnames<-`(c('Fund ID', 'Sec ID', 'Net Assets', 'Return', 'Rating'))
df <- split(df, df$`Fund ID`) |>
lapply(function(x) {
na <- x$`Net Assets` |>
as.numeric()
ret <- (x$Return |>
str_replace_all('%', '')|>
str_replace_all(',', '.') |>
as.numeric()) / 1e2
x$fund_return <- (sum(na * ret) / sum(na) * 1e2) |>
round(2) |>
sprintf(fmt = '%s%%') |>
str_replace_all('\\.', ',')
x
}) |>
rbindlist() |>
as.data.frame()
df <- split(df, df$`Fund ID`) |>
lapply(function(x) {
na <- x$`Net Assets` |>
as.numeric()
rating <- (x$Rating |>
str_replace_all('stars', '') |>
as.numeric())
x$calculated_rating <- (sum(na * rating) / sum(na)) |>
round()
x
}) |>
rbindlist() |>
as.data.frame()
df |> print()
Output:
Fund ID Sec ID Net Assets Return Rating fund_return calculated_rating
1 A A1 100 1% 4 stars 0,92% 4
2 A A2 200 1,2 % 4 stars 0,92% 4
3 A A3 150 0,5 % 3 stars 0,92% 4
4 B B1 50 1,1 % 2 stars 0,85% 3
5 B B2 120 0,75% 3 stars 0,85% 3
6 C C1 300 0,4% 5 stars 0,49% 4
7 C C2 500 0,55% 4 stars 0,49% 4
CodePudding user response:
Consider a split/apply approach using by
to run aggregate calculations by fund-level grouping:
# CLEAN NUMERIC COLUMNS
fund_df <- within(
fund_df, {
Rating_num <- as.integer(gsub(" stars", "", Rating))
Return_pct <- as.numeric(gsub("%", "", gsub(",", ".", Return, fixed=TRUE), fixed=TRUE)
) / 100
}
)
# SPLIT/APPLY TO CALCULATE GROUP LEVEL CALCULATIONS
fund_dfs <- by(
fund_df,
fund_df$`Fund ID`,
function(sub) within(
sub, {
fund_rating <- round(sum(`Net Assets` * Rating_num) / sum(`Net Assets`))
fund_return <- sum(`Net Assets` * Return_pct) / sum(`Net Assets`)
}
)
)
# COMBINE SPLITS TO SINGLE DATA FRAME
fund_df <- do.call(rbind, unname(fund_dfs))
fund_df
# Fund ID Sec ID Net Assets Return Rating Return_pct Rating_num fund_return fund_rating
# 1 A A1 100 1% 4 stars 0.0100 4 0.009222222 4
# 2 A A2 200 1,2% 4 stars 0.0120 4 0.009222222 4
# 3 A A3 150 0,5% 3 stars 0.0050 3 0.009222222 4
# 4 B B1 50 1,1% 2 stars 0.0110 2 0.008529412 3
# 5 B B2 120 0,75% 3 stars 0.0075 3 0.008529412 3
# 6 C C1 300 0,4% 5 stars 0.0040 5 0.004937500 4
# 7 C C2 500 0,55% 4 stars 0.0055 4 0.004937500 4