How to group_by 4 or more variables using dplyr-CodePudding

I need to group my dataset for 4 variables. Variables a,b,c, and d.

Data frame example

Desired result

I wish to run the group_by function or group the 4 columns like this- group_by(col1, col2,col3,col4) but it doesn't work, it only takes the first 3 columns.

CodePudding user response：

My guess, since you share no code: because you said "it only takes first 3 columns", that suggests that with four arguments, the first is being interpreted as the dataframe. From that: group_by(col1, col2, col3, col4) with nothing before it is assuming that col1 is data. If you had

mydata # with or without this
group_by(col1, col2, col3, col4) %>%
  ...

then change it to

mydata %>%
  group_by(col1, col2, col3, col4) %>%
  ...

group_by(mydata, col1, col2, col3, col4) %>%
  ...

CodePudding user response：

Using only group_by does not make any visual change in the dataframe. You need to do something more after group_by.

Here is an example with mtcars dataset -

df <- mtcars[1:10, c(1, 2, 8, 9)]
df

#                   mpg cyl vs am
#Mazda RX4         21.0   6  0  1
#Mazda RX4 Wag     21.0   6  0  1
#Datsun 710        22.8   4  1  1
#Hornet 4 Drive    21.4   6  1  0
#Hornet Sportabout 18.7   8  0  0
#Valiant           18.1   6  1  0
#Duster 360        14.3   8  0  0
#Merc 240D         24.4   4  1  0
#Merc 230          22.8   4  1  0
#Merc 280          19.2   6  1  0

Using only group_by -

df %>% group_by(cyl, vs, am)

# A tibble: 10 × 4
# Groups:   cyl, vs, am [5]
#     mpg   cyl    vs    am
#   <dbl> <dbl> <dbl> <dbl>
# 1  21       6     0     1
# 2  21       6     0     1
# 3  22.8     4     1     1
# 4  21.4     6     1     0
# 5  18.7     8     0     0
# 6  18.1     6     1     0
# 7  14.3     8     0     0
# 8  24.4     4     1     0
# 9  22.8     4     1     0
#10  19.2     6     1     0

You need to "tell" what you want to do after group_by, for example you can sum the mpg values.

df %>% group_by(cyl, vs, am) %>% summarise(sum_mpg = sum(mpg), .groups = 'drop')

#    cyl    vs    am sum_mpg
#  <dbl> <dbl> <dbl>   <dbl>
#1     4     1     0    47.2
#2     4     1     1    22.8
#3     6     0     1    42  
#4     6     1     0    58.7
#5     8     0     0    33