Home > OS >  Subsetting dataframe in grouped data
Subsetting dataframe in grouped data

Time:02-14

I have a dataframe including a column of factors that I would like to subset to select every nth row, after grouping by factor level. For example,

my_df <- data.frame(col1 = c(1:12), col2 = rep(c("A","B", "C"), 4))
my_df
  col1 col2
1     1    A
2     2    B
3     3    C
4     4    A
5     5    B
6     6    C
7     7    A
8     8    B
9     9    C
10   10    A
11   11    B
12   12    C

Subsetting to select every 2nd row should yield my_new_df as,

  col1 col2
1    4    A
2   10    A
3    5    B
4   11    B
5    6    C
6   12    C

I tried in dplyr:

my_df %>% group_by(col2) %>%
my_df[seq(2, nrow(my_df), 2), ] -> my_new_df

I get an error:

Error: Can't subset columns that don't exist.
x Locations 4, 6, 8, 10, and 12 don't exist.
ℹ There are only 2 columns.

To see if the nrow function was a problem, I tried using the number directly. So,

my_df %>% group_by(col2) %>%
   my_df[seq(2, 4, 2), ] -> my_new_df

Also gave an error,

Error: Can't subset columns that don't exist.
x Location 4 doesn't exist.
ℹ There are only 2 columns.
Run `rlang::last_error()` to see where the error occurred.

My expectation was that it would run the subsetting on each group of data and then combine them into 'my_new_df'. My understanding of how group_by works is clearly wrong but I am stuck on how to move past this error. Any help would much appreciated.

CodePudding user response:

Try:

my_df %>%
  group_by(col2)%>%
  slice(seq(from = 2, to = n(), by = 2))

# A tibble: 6 x 2
# Groups:   col2 [3]
   col1 col2 
  <int> <chr>
1     4 A    
2    10 A    
3     5 B    
4    11 B    
5     6 C    
6    12 C   

You might want to ungroup after slicing if you want to do other operations not based on col2.

CodePudding user response:

Here is a data.table option:

library(data.table)
data <- as.data.table(my_df)

data[(rowid(col2) %% 2) == 0]

   col1 col2
1:    4    A
2:    5    B
3:    6    C
4:   10    A
5:   11    B
6:   12    C

Or base R:

my_df[as.logical(with(my_df, ave(col1, col2, FUN = function(x) 
  seq_along(x) %% 2 == 0))), ]

   col1 col2
4     4    A
5     5    B
6     6    C
10   10    A
11   11    B
12   12    C
  • Related