how to make find correlation for each year?-CodePudding

I'm trying to find the correlation for each year in a data frame using group_by and mutate. When I try to find the correlation coefficient of two different years, I get two different values. When I try to calculate the correlation coefficient for each year all at once, the same value is shown for each year.

Here, the correlation coefficients are different.

gm52 <- gapminder %>% filter (year == 1952)
cor(gm52$lifeExp, gm52$gdpPercap, use = "complete.obs")
gm57 <- gapminder %>% filter (year == 1957)
cor(gm57$lifeExp, gm57$gdpPercap, use = "complete.obs")

When I try to calculate them all at once, it's the same value in the new data frame

gmtest <- gapminder %>%
  group_by(year) %>%
  mutate(crltn = cor(gmtest$lifeExp, gmtest$gdpPercap, use = "complete.obs"))

CodePudding user response：

When you use dplyr::mutate, you are adding a new variable to the existing data frame. As the gapminder data is 1704 rows, you will get a vector of length 1704. However, because you have grouped by year, there will be only one value per year. This means that the value for each year is recycled until it is the length of all observations in that year.

Conversely, if you use dplyr::summarise, this reduces the length of each group to the length of the output of the function. In this case, the function is cor() and the output is one number, so this means one row per year.

Additionally, with dplyr you do not need to use dollar notation, e.g. gmtest$lifeExp, and this can break the grouping. It is preferable to omit the name of the data frame and write lifeExp.

library(gapminder)
library(dplyr)

gapminder |>
  group_by(year) |>
  summarise(
    crltn = cor(lifeExp, gdpPercap, use = "complete.obs")
)
# # A tibble: 12 x 2
#     year crltn
#    <int> <dbl>
#  1  1952 0.278
#  2  1957 0.304
#  3  1962 0.383
#  4  1967 0.480
#  5  1972 0.460
#  6  1977 0.620
#  7  1982 0.723
#  8  1987 0.750
#  9  1992 0.705
# 10  1997 0.704
# 11  2002 0.682
# 12  2007 0.679

CodePudding user response：

I'm not sure if I'm missing it in the question, but it seems your main objective is simply to get correlations by year. You dont really have to use mutate or summarise if you use the correlation package (though the other answer here is certainly a good enough option). As a demonstration, I've emulated correlations on the same variables as the last answer:

#### Load Libraries ####
library(gapminder)
library(tidyverse)
library(correlation)

#### Single Command Correlation ####
gapminder %>% 
  group_by(year) %>% 
  select(lifeExp,
         gdpPercap) %>% 
  correlation()

Which gives you similar output:

# Correlation Matrix (pearson-method)

Group | Parameter1 | Parameter2 |    r |       95% CI | t(140) |         p
--------------------------------------------------------------------------
1952  |    lifeExp |  gdpPercap | 0.28 | [0.12, 0.42] |   3.42 | < .001***
1957  |    lifeExp |  gdpPercap | 0.30 | [0.15, 0.45] |   3.77 | < .001***
1962  |    lifeExp |  gdpPercap | 0.38 | [0.23, 0.52] |   4.91 | < .001***
1967  |    lifeExp |  gdpPercap | 0.48 | [0.34, 0.60] |   6.48 | < .001***
1972  |    lifeExp |  gdpPercap | 0.46 | [0.32, 0.58] |   6.12 | < .001***
1977  |    lifeExp |  gdpPercap | 0.62 | [0.51, 0.71] |   9.35 | < .001***
1982  |    lifeExp |  gdpPercap | 0.72 | [0.63, 0.79] |  12.37 | < .001***
1987  |    lifeExp |  gdpPercap | 0.75 | [0.67, 0.81] |  13.41 | < .001***
1992  |    lifeExp |  gdpPercap | 0.70 | [0.61, 0.78] |  11.75 | < .001***
1997  |    lifeExp |  gdpPercap | 0.70 | [0.61, 0.78] |  11.72 | < .001***
2002  |    lifeExp |  gdpPercap | 0.68 | [0.58, 0.76] |  11.03 | < .001***
2007  |    lifeExp |  gdpPercap | 0.68 | [0.58, 0.76] |  10.93 | < .001***

p-value adjustment method: Holm (1979)
Observations: 142