I am trying to calculate the correlation between price & cut in the diamonds data set, grouped by color.
I have constructed a pipeline of dplyr commands that returns what I want:
library(dplyr)
library(ggplot2)
data(diamonds)
df <- data.frame(group = diamonds$color, a = diamonds$price , b = diamonds$depth )
df %>% group_by(group) %>% summarize(Corr = cor(a,b)) %>% as.data.frame()
This outputs:(what I want)
group Corr
1 D 0.013415309
2 E 0.017037228
3 F 0.079294072
4 G -0.032000363
5 H 0.051953865
6 I 0.009288322
7 J 0.086863041
But I would like to create a for loop that serves the exact same purpose.
I understand the basics of a for loop, but to me it seems difficult to figure out the logic around such a task. Dplyr makes a lot of sense as it is.
Any ideas? Thank you (and happy holidays!)
CodePudding user response:
I think the first thing to say is there's no particular reason to do this in a for
loop - often avoiding loops is preferable in R, and here the dplyr
syntax is designed for this kind of operation.
However if you want to use a loop to get more familiar with them, I would be inclined to split()
the data frame into a list of data frames, each of which contains one color
.
# Produces a list of length 7
diamonds_color_list <- split(diamonds, ~color)
diamonds_color_list[[1]]
# # A tibble: 6,775 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Very Good D VS2 60.5 61 357 3.96 3.97 2.4
# 2 0.23 Very Good D VS1 61.9 58 402 3.92 3.96 2.44
# 3 0.26 Very Good D VS2 60.8 59 403 4.13 4.16 2.52
# 4 0.26 Good D VS2 65.2 56 403 3.99 4.02 2.61
# 5 0.26 Good D VS1 58.4 63 403 4.19 4.24 2.46
# 6 0.22 Premium D VS2 59.3 62 404 3.91 3.88 2.31
# 7 0.3 Premium D SI1 62.6 59 552 4.23 4.27 2.66
# 8 0.3 Ideal D SI1 62.5 57 552 4.29 4.32 2.69
# 9 0.3 Ideal D SI1 62.1 56 552 4.3 4.33 2.68
# 10 0.24 Very Good D VVS1 61.5 60 553 3.97 4 2.45
# # ... with 6,765 more rows
To calculate the correlation of price
and depth
for the first color
you could do, cor(diamonds_color_list[[1]]$price, diamonds_color_list[[1]]$depth)
.
It might seem tempting iterate through this list by index, starting with for(i in 1:7) cor(diamonds_color_list[[i]]$price, diamonds_color_list[[i]]$depth)
. This is OK - though for(i in seq_along(diamonds_color_list))
would be better.
However, I think code is more readable if you iterate over the color names, so I would do something along the lines of:
colors <- names(diamonds_color_list) # D" "E" "F" "G" "H" "I" "J"
# Create empty list of correct length so you don't
# commit the sin of growing an object in a loop
results_list <- vector(mode="list", length = length(diamonds_color_list)) |>
setNames(colors)
# Loop through each color in the data frame
# calculate the cor and add to the list
for (color in colors) {
result <- list(
color = color,
cor = cor(
diamonds_color_list[[color]]$price,
diamonds_color_list[[color]]$depth
)
)
results_list[[color]] <- result
}
# Bind them together into a data frame
do.call(rbind, results_list)
# color cor
# D "D" -0.01352522
# E "E" -0.005518622
# F "F" 0.006164055
# G "G" -0.007661284
# H "H" -0.02033827
# I "I" -0.08299649
# J "J" -0.04973188
All these functions are base R and do not require dplyr
.