Home > Enterprise >  Computing correlations between variables in 2 dataframes
Computing correlations between variables in 2 dataframes

Time:10-13

Am trying to compute the correlations of the below countries, with USA. I have a relatively big dataset with 80 variables & 3000 observations in my first df as below, so am trying to use R to automate this instead of using excel.

I am trying to compute correlations for the countries in the first df (i.e. Germany, Italy, Japan and more) with USA in the 2nd df. So it should go Germany - USA, Italy - USA, Japan - USA and so on.

Not too sure how should I begin - should I loop every column in the first table to correlate with USA in the 2nd? Help is much appreciated.

Thanks!

df1

Date Germany Italy Japan More countries...
01-01-2020 1000 200 2304 More numbers...
01-02-2020 2000 389 2098 More numbers...

and on and on

df2

Date USA
01-01-2020 500
01-02-2020 600

and on and on

CodePudding user response:

Something like that should do the trick:

df1 <- 
  tibble(
    date = 2001:2010,
    Germany = runif(10),
    Italy = runif(10),
    Japan = runif(10)
  )

df2 <-
  tibble(
    date = 2001:2010,
    USA = runif(10)
  )

df.cor <-
  df1 %>%
  summarise(across(-one_of('date'), ~ cor(.x, df2$USA)))

df.cor

Note: You have to be sure that dates are consistent between df1 and df2. You can use join function (e.g. left_join) to ensure this

CodePudding user response:

You could use this approach:

countries = c("Germany", "Italy", "Japan")
left_join(df1, df2) %>% summarise(across(countries, ~cor(., USA)))
  1. left_join merges df1 and df2 together so that the dates always match up with one another
  2. summarise allows you to perform column-wise operations
  3. across tells you which columns you want to make a correlation with USA
  4. ~cor(., USA) says take each country and perform the correlation with USA
Germany  Italy  Japan
    <dbl>  <dbl>  <dbl>
1  -0.393 -0.147 -0.214

Thank you Damien Georges for the data.

  •  Tags:  
  • r
  • Related