Calculating CAGR with differing time-based data. Issue with variable column referencing in R-CodePudding

I have a table of Calpers Private Equity Fund performance from several years. I cleaned and joined all the data into a large table with 186 entries for individual fund investments. Some of these funds have data for 5 yrs, most for 4 or less. I would like to calculate the CAGR for each fund using the earliest value and the latest value in the formula:

CAGR= Latest/First^(1/n)-1 ...

The columns with the data are named: 2017,2018,2019,2020,2021, so the formula in R will look something like this: (calper is the table with all the data ... one fund per row)

idx<- which(startsWith(names(calperMV),"2")) # locate columns with data needed for CAGR calc
idx <- rev(idx) # match to NCOL_NA order ...

the values here are (6,5,4,3,2) ... which are the column numbers for 2021-2020-2019-2018-2017.

the indx was formed by counting the number of NA in each row ... all the NA are left to right, so the totals here should be a reference to the idx and thus the correct columns.

I use the !!sym(as.String()) with name()[idx[indx]] to pull out the column names symbolically

calperMV %>% rowwise() %>% 
mutate(CAGR=`2021`/!!sym((colnames(.)[idx[indx]])^(1/(5-indx))-1))))

Problem is that the referencing either does not work correctly or gets this error:

"Error in local_error_context(dots = dots, .index = i, mask = mask) : promise already under evaluation: recursive default argument reference or earlier problems?"

I've tried creating test code which shows the addressing is working:

calper %>% rowwise() %>% mutate(test = (names(.)[idx[indx]]),
                                   test1= !!sym(as.String(names(.)[idx[1]])),
                                   test2= !!sym(as.String(names(.)[idx[2]])),
                                   test3= !!sym(as.String(names(.)[idx[3]])),
                                   test4= !!sym(as.String(names(.)[idx[4]])),
                                   test5= !!sym(as.String(names(.)[idx[5]])))

But when I do the full CAGR calc I get that recursive error. Here'a tibble of the test data for reference:

Input data:

Security Name   2017    2018    2019    2020    2021                  NA_cols   indx
ASIA ALT         NA      NA      NA      6,256,876.00    7,687,037.00   3       2
ASIA ALT     NA      NA      NA      32,549,704.00   34,813,844.00  3   2
AVATAR NA    NA      NA      NA      700,088.00      -      3   2
AVENUE FUND VI (A)   NA      NA      NA      10,561,674.00   19,145,496.00  3   2
BDC III C    NA      48,098,429.00   85,808,280.00   100,933,699.00      146,420,669.00     1   4
BIRCH HILL   NA      NA      NA      6,488,941.00    9,348,941.00   3   2
BLACKSTONE   NA      NA      NA      4,011,072.00    2,406,075.00   3   2
BLACKSTONE  IV   NA      NA      NA      4,923,625.00    3,101,081.00   3   2
BLACKSTONE  V    NA      NA      NA      18,456,472.00   17,796,711.00  3   2
BLACKSTONE VI    NA      NA      NA      245,269,656.00      310,576,064.00     3   2
BLACKSTONE VII   NA      NA      NA      465,415,036.00      607,172,062.00     3   2

Results: The indexing selects the proper String and also selects the proper # from the column ... but won't do when I operate with the selected variable:

selYR    test1   test2   test3   test4   test5 
2020     7,687,037.00    6,256,876.00    NA      NA      NA 
2020     34,813,844.00   32,549,704.00   NA      NA      NA 
2020     -       700,088.00      NA      NA      NA 
2020     19,145,496.00   10,561,674.00   NA      NA      NA 
2018     146,420,669.00      100,933,699.00      85,808,280.00   48,098,429.00   NA 
2020     9,348,941.00    6,488,941.00    NA      NA      NA 
2020     2,406,075.00    4,011,072.00    NA      NA      NA 
2020     3,101,081.00    4,923,625.00    NA      NA      NA 
2020     17,796,711.00   18,456,472.00   NA      NA      NA 
2020     310,576,064.00      245,269,656.00      NA      NA      NA 
2020     607,172,062.00      465,415,036.00      NA      NA      NA

(Sorry ... I don't know how to put these into proper columns :( )

CodePudding user response：

I never learned all those fancy tidystuff techniques. Here's a base R approach:

Firstand second: Use read.delim to bring in tab data and your data has (yeccch) commas in the numbers

 calpDat <- read.delim(text=calpTab)
 calpDat[2:6] <- lapply(calpDat[2:6], function(x) as.numeric(gsub("[,]", "",x)))
Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In FUN(X[[i]], ...) : NAs introduced by coercion
3: In FUN(X[[i]], ...) : NAs introduced by coercion
4: In FUN(X[[i]], ...) : NAs introduced by coercion

Now that the data is in, you can count the number of valid numbers and then calculate the CAGR for each row using apply on the numeric columns in a rowwise fashion:

calpDat$CAGR <- apply(calpDat[2:6], 1, function(rw) {n <- length(na.omit(rw));
                              (rw[5]/rw[6-n])^(1/n) -1})
 calpDat
#----------------
        Security.Name X2017    X2018    X2019     X2020     X2021 NA_cols indx        CAGR
1       ASIA ALT         NA       NA       NA   6256876   7687037       3    2  0.10841071
2           ASIA ALT     NA       NA       NA  32549704  34813844       3    2  0.03419508
3           AVATAR NA    NA       NA       NA    700088        NA       3    2          NA
4  AVENUE FUND VI (A)    NA       NA       NA  10561674  19145496       3    2  0.34637777
5           BDC III C    NA 48098429 85808280 100933699 146420669       1    4  0.32089372
6          BIRCH HILL    NA       NA       NA   6488941   9348941       3    2  0.20031241
7         BLACKSTONE     NA       NA       NA   4011072   2406075       3    2 -0.22549478
8      BLACKSTONE  IV    NA       NA       NA   4923625   3101081       3    2 -0.20637732
9       BLACKSTONE  V    NA       NA       NA  18456472  17796711       3    2 -0.01803608
10      BLACKSTONE VI    NA       NA       NA 245269656 310576064       3    2  0.12528383
11     BLACKSTONE VII    NA       NA       NA 465415036 607172062       3    2  0.14218298

Problems remaining ... funds that did not have a value in the most recent year; funds that might have had discontinuous reporting. You need to say how these would be handled and provide example data if you want tested solutions.