Home > Back-end >  Convert dplyr Chain to "for loop"
Convert dplyr Chain to "for loop"

Time:12-26

I am trying to calculate the correlation between price & cut in the diamonds data set, grouped by color.

I have constructed a pipeline of dplyr commands that returns what I want:

library(dplyr)
library(ggplot2)
data(diamonds)

df <- data.frame(group = diamonds$color, a = diamonds$price , b = diamonds$depth )

df %>% group_by(group) %>% summarize(Corr = cor(a,b)) %>% as.data.frame()

This outputs:(what I want)


  group         Corr
1     D  0.013415309
2     E  0.017037228
3     F  0.079294072
4     G -0.032000363
5     H  0.051953865
6     I  0.009288322
7     J  0.086863041

But I would like to create a for loop that serves the exact same purpose.

I understand the basics of a for loop, but to me it seems difficult to figure out the logic around such a task. Dplyr makes a lot of sense as it is.

Any ideas? Thank you (and happy holidays!)

CodePudding user response:

I think the first thing to say is there's no particular reason to do this in a for loop - often avoiding loops is preferable in R, and here the dplyr syntax is designed for this kind of operation.

However if you want to use a loop to get more familiar with them, I would be inclined to split() the data frame into a list of data frames, each of which contains one color.

# Produces a list of length 7
diamonds_color_list  <- split(diamonds, ~color)

diamonds_color_list[[1]]
# # A tibble: 6,775 x 10
#    carat cut       color clarity depth table price     x     y     z
#    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#  1  0.23 Very Good D     VS2      60.5    61   357  3.96  3.97  2.4 
#  2  0.23 Very Good D     VS1      61.9    58   402  3.92  3.96  2.44
#  3  0.26 Very Good D     VS2      60.8    59   403  4.13  4.16  2.52
#  4  0.26 Good      D     VS2      65.2    56   403  3.99  4.02  2.61
#  5  0.26 Good      D     VS1      58.4    63   403  4.19  4.24  2.46
#  6  0.22 Premium   D     VS2      59.3    62   404  3.91  3.88  2.31
#  7  0.3  Premium   D     SI1      62.6    59   552  4.23  4.27  2.66
#  8  0.3  Ideal     D     SI1      62.5    57   552  4.29  4.32  2.69
#  9  0.3  Ideal     D     SI1      62.1    56   552  4.3   4.33  2.68
# 10  0.24 Very Good D     VVS1     61.5    60   553  3.97  4     2.45
# # ... with 6,765 more rows

To calculate the correlation of price and depth for the first color you could do, cor(diamonds_color_list[[1]]$price, diamonds_color_list[[1]]$depth).

It might seem tempting iterate through this list by index, starting with for(i in 1:7) cor(diamonds_color_list[[i]]$price, diamonds_color_list[[i]]$depth). This is OK - though for(i in seq_along(diamonds_color_list)) would be better.

However, I think code is more readable if you iterate over the color names, so I would do something along the lines of:

colors  <- names(diamonds_color_list) # D" "E" "F" "G" "H" "I" "J"

# Create empty list of correct length so you don't
# commit the sin of growing an object in a loop
results_list  <- vector(mode="list", length = length(diamonds_color_list))  |>
    setNames(colors)

# Loop through each color in the data frame 
# calculate the cor and add to the list
for (color in colors) {
    result <- list(
        color = color,
        cor = cor(
            diamonds_color_list[[color]]$price,
            diamonds_color_list[[color]]$depth
        )
    )
    results_list[[color]]  <- result
}

# Bind them together into a data frame
do.call(rbind, results_list)
#   color cor
# D "D"   -0.01352522
# E "E"   -0.005518622
# F "F"   0.006164055
# G "G"   -0.007661284
# H "H"   -0.02033827
# I "I"   -0.08299649
# J "J"   -0.04973188

All these functions are base R and do not require dplyr.

  • Related