Computing correlations between variables in 2 dataframes-CodePudding

Am trying to compute the correlations of the below countries, with USA. I have a relatively big dataset with 80 variables & 3000 observations in my first df as below, so am trying to use R to automate this instead of using excel.

I am trying to compute correlations for the countries in the first df (i.e. Germany, Italy, Japan and more) with USA in the 2nd df. So it should go Germany - USA, Italy - USA, Japan - USA and so on.

Not too sure how should I begin - should I loop every column in the first table to correlate with USA in the 2nd? Help is much appreciated.

Thanks!

df1

Date	Germany	Italy	Japan	More countries...
01-01-2020	1000	200	2304	More numbers...
01-02-2020	2000	389	2098	More numbers...

and on and on

df2

Date	USA
01-01-2020	500
01-02-2020	600

and on and on

CodePudding user response：

You could use this approach:

library(dplyr);library(magrittr)
countries = c("Germany", "Italy", "Japan")
left_join(df1, df2) %>% summarise(across(countries, ~cor(., USA)))

left_join merges df1 and df2 together so that the dates always match up with one another
summarise allows you to perform column-wise operations
across tells you which columns you want to make a correlation with USA
~cor(., USA) says take each country and perform the correlation with USA

Germany  Italy  Japan
    <dbl>  <dbl>  <dbl>
1  -0.393 -0.147 -0.214

Thank you Damien Georges for the data.

CodePudding user response：

Something like that should do the trick:

library(dplyr)

df1 <- 
  tibble(
    date = 2001:2010,
    Germany = runif(10),
    Italy = runif(10),
    Japan = runif(10)
  )

df2 <-
  tibble(
    date = 2001:2010,
    USA = runif(10)
  )

df.cor <-
  df1 %>%
  summarise(across(-one_of('date'), ~ cor(.x, df2$USA)))

df.cor

Note: You have to be sure that dates are consistent between df1 and df2. You can use join function (e.g. left_join) to ensure this

CodePudding user response：

Here are two base R solutions, depending on the final format you want.
Both with the new pipe operator, introduced in R 4.1.0.

df2[-1] |> cor(df1[-1]) |> as.data.frame()
#      Germany     Italy     Japan
#USA 0.3161338 0.5483885 0.1725733

df1[-1] |> cor(df2[-1]) |> as.data.frame()
#              USA
#Germany 0.3161338
#Italy   0.5483885
#Japan   0.1725733

More traditional but equivalent versions:

as.data.frame(cor(df2[-1], df1[-1]))
as.data.frame(cor(df1[-1], df2[-1]))

Data

Data creation code borrowed from Damien Georges.

set.seed(2021)
df1 <- 
  data.frame(
    date = 2001:2010,
    Germany = runif(10),
    Italy = runif(10),
    Japan = runif(10)
  )

df2 <-
  data.frame(
    date = 2001:2010,
    USA = runif(10)
  )