I have a ggplot for a logarithmic relationship between variable growth_rate and tenure:
pdata %>%
ggplot(aes(x = log(TENURE), y = GROWTH_RATE))
geom_point(color = 'gray', alpha = 0.3)
geom_smooth(method = 'lm', formula = 'y ~ x')
But the geom_smooth appears to fit better with:
pdata %>%
ggplot(aes(x = log(TENURE), y = GROWTH_RATE))
geom_point(color = 'gray', alpha = 0.3)
geom_smooth(method = 'lm', formula = 'y ~ log(x)')
Which plot is correct? Which plot shows a smooth fit line based on a linear model with formula y ~ log(TENURE)
?
CodePudding user response:
It looks like your underlying growth rate varies with the log of the log of tenure. Here's some sample data with that "log of log" relationship:
tibble(TENURE = runif(1E4, min = 7, max = 1000),
GROWTH_RATE = rnorm(1E4, mean = 1, sd = 0.1) * log(log(TENURE))) %>%
ggplot(aes(log(TENURE), GROWTH_RATE))
geom_point(alpha = 0.3, color = "gray50")
geom_smooth(method = 'lm', formula = 'y ~ x')
Plotting growth against the log results in a loose fit like your first one. Note that the lm
is using the transformed values from your x
and y
mapping, so we can see that it is using log(TENURE)
for x
. (See bottom for a confirmation of that.)
But modeling against the log of the log of tenure is a better fit. Here, when we use y ~ log(x)
, it means y ~ log( [log(TENURE)] )
since x is globally mapped in ggplot(aes(...))
to relate to the log of TENURE.
... geom_smooth(method = 'lm', formula = 'y ~ log(x)')
If instead the original relationship had been a good fit for y ~ log(x)
, like the different generated data here, your first lm
would have matched better:
tibble(TENURE = runif(1E4, min = 7, max = 1000),
GROWTH_RATE = rnorm(1E4, mean = 1, sd = 0.1) * log(TENURE)) %>%
ggplot(aes(log(TENURE), GROWTH_RATE))
geom_point(alpha = 0.3, color = "gray50")
geom_smooth(method = 'lm', formula = 'y ~ x')