Home > Net >  R: Group by and Apply a general function to two columns
R: Group by and Apply a general function to two columns

Time:02-19

Hi I'd like to groupby two dataframe columns, and apply a function to aother two dataframe columns. For e.g.,

ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,2,4,6,9,5)
vol <- c(3,5,1,6,2,3)
dat <- data.frame(ticker,date,ret,vol)

For each ticker and each date, I'd like to calculate its PIN.

Now, to avoid further confusion, perhaps it helps to just speak out the actual function. YZ is a function in the InfoTrad package, and YZ only accepts a dataframe with two columns. It uses some optimisation tool and returns an estimated PIN.

install.packages(InfoTrad)
library(InfoTrad)
get_pin_yz <- function(data) {
  return(YZ(data[ ,c('volume_krw_buy', 'volume_krw_sell')])[['PIN']])
}

I know how to do this in R using for loop. But for loop is very computationally costly, and it might take weeks to finish running my large dataset. Thus, I would like to ask how to do this using groupby.

# output format is wide wrt long format as "dat"
dat_w <- data.frame(ticker = NA, date = NA, PIN = NA)
for (j in c("A", "B")){
  
  for (k in c(1:2)){
    
    subset <- dat %>% subset((ticker == j & date == k), select = c('ret', "vol"))
    new_row <- data.frame(ticker = j, date = k, PIN = YZ(subset)$PIN)
    dat_w <- rbind(dat_w, new_row)
  }
}
dat_w <- dat_w[-1, ]
dat_w

Don't know if this can help you help me -- I know how to do this in python: I just write a function and run df.groupby(['ticker','date']).apply(function).

Finally, the wanted dataframe is:

ticker <- c('A','A','B','B')
date <- c(1,2,1,2)
PIN <- c(1.05e-17,2.81e-09,1.12e-08,5.39e-09)
data.frame(ticker,date,PIN)

Could somebody help out, please?

Thank you!

Best,

Darcy


Previous stuff (Feel free to ignore) Previously, I wrote this: My function is:

get_rv <- function(data) {
  return(data[['vol']]   data[['ret']])
}

What I want is:

ticker_wanted <- c('A','A', 'B', 'B')
date_wanted <- c(1,2,1,2)
rv_wanted <- c(7,5,10,11)
df_wanted <-data.frame(ticker_wanted,date_wanted,rv_wanted)

But this is not literally what my actual function is. The vol ret is just an example. I'm more interested in the more general case: how to groupby and apply a general function to two or more dataframes. I use the vol ret just because I didn't want to bother others by asking them to install some potentially irrelevant package on their PC.

CodePudding user response:

you could just do this? (with summarise as an example of your function):

ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,-2,4,6,9,-5)
vol <- c(3,5,1,6,2,3)
df <- data.frame(ticker,date,ret,vol)

df_wanted <- get_rv(df)

get_rv <- function(data){
  result <- data %>%
    group_by(ticker,date) %>%
    summarise(rv =sum(ret)   sum(vol)) %>%
    as.data.frame()
  names(result) <- c('ticker_wanted', 'date_wanted', 'rv_wanted')
  return(result)
}    

CodePudding user response:

Update based on real-life example:

You can do a direct approach like this:

library(tidyverse)
library(InfoTrad)
dat %>%
  group_by(ticker, date) %>%
  summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)

# A tibble: 4 x 3
# Groups:   ticker [2]
  ticker  date      PIN
  <chr>  <dbl>    <dbl>
1 A          1 1.05e-17
2 A          2 1.56e- 1
3 B          1 1.12e- 8
4 B          2 7.07e- 9

The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.

You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.

I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.


In your example, you can do:

my_function <- function(data) {
  data %>%
    summarize(rv = sum(ret, vol))
}

library(tidyverse)
df %>%
  group_by(ticker, date) %>%
  my_function()

# A tibble: 4 x 3
# Groups:   ticker [2]
  ticker  date    rv
  <chr>  <dbl> <dbl>
1 A          1     7
2 A          2     5
3 B          1    10
4 B          2    11

But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.

Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.

  • Related