Home > database >  Computing correlations between variables in 2 dataframes
Computing correlations between variables in 2 dataframes

Time:10-14

Am trying to compute the correlations of the below countries, with USA. I have a relatively big dataset with 80 variables & 3000 observations in my first df as below, so am trying to use R to automate this instead of using excel.

I am trying to compute correlations for the countries in the first df (i.e. Germany, Italy, Japan and more) with USA in the 2nd df. So it should go Germany - USA, Italy - USA, Japan - USA and so on.

Not too sure how should I begin - should I loop every column in the first table to correlate with USA in the 2nd? Help is much appreciated.

Thanks!

df1

Date Germany Italy Japan More countries...
01-01-2020 1000 200 2304 More numbers...
01-02-2020 2000 389 2098 More numbers...

and on and on

df2

Date USA
01-01-2020 500
01-02-2020 600

and on and on

CodePudding user response:

You could use this approach:

library(dplyr);library(magrittr)
countries = c("Germany", "Italy", "Japan")
left_join(df1, df2) %>% summarise(across(countries, ~cor(., USA)))
  1. left_join merges df1 and df2 together so that the dates always match up with one another
  2. summarise allows you to perform column-wise operations
  3. across tells you which columns you want to make a correlation with USA
  4. ~cor(., USA) says take each country and perform the correlation with USA
Germany  Italy  Japan
    <dbl>  <dbl>  <dbl>
1  -0.393 -0.147 -0.214

Thank you Damien Georges for the data.

CodePudding user response:

Something like that should do the trick:

library(dplyr)

df1 <- 
  tibble(
    date = 2001:2010,
    Germany = runif(10),
    Italy = runif(10),
    Japan = runif(10)
  )

df2 <-
  tibble(
    date = 2001:2010,
    USA = runif(10)
  )

df.cor <-
  df1 %>%
  summarise(across(-one_of('date'), ~ cor(.x, df2$USA)))

df.cor

Note: You have to be sure that dates are consistent between df1 and df2. You can use join function (e.g. left_join) to ensure this

CodePudding user response:

Here are two base R solutions, depending on the final format you want.
Both with the new pipe operator, introduced in R 4.1.0.

df2[-1] |> cor(df1[-1]) |> as.data.frame()
#      Germany     Italy     Japan
#USA 0.3161338 0.5483885 0.1725733

df1[-1] |> cor(df2[-1]) |> as.data.frame()
#              USA
#Germany 0.3161338
#Italy   0.5483885
#Japan   0.1725733

More traditional but equivalent versions:

as.data.frame(cor(df2[-1], df1[-1]))
as.data.frame(cor(df1[-1], df2[-1]))

Data

Data creation code borrowed from Damien Georges.

set.seed(2021)
df1 <- 
  data.frame(
    date = 2001:2010,
    Germany = runif(10),
    Italy = runif(10),
    Japan = runif(10)
  )

df2 <-
  data.frame(
    date = 2001:2010,
    USA = runif(10)
  )
  •  Tags:  
  • r
  • Related