Home > front end >  How to divide dataset into some blocks and choose the largest one?
How to divide dataset into some blocks and choose the largest one?


In R, I try to divide n=10000 iid observations into 100 blocks and each block with size n/100=10. Then for each block I want to choose the largest value for each block as a new dataset with size 100. How to achieve this point in R?

For example,

#sample data
exp_data=rexp(n, 1)

CodePudding user response:

First you need a column that provides the grouping, in this example assume the groups are sequential (i.e. first 100 values belong to the first group, second 100 to the second group and so on):

df = data.frame(values = exp_data,
                group = floor((1:length(exp_data))/100))

Now, just use tapply to get the maximum:

with(df, tapply(X = values, 
                INDEX = group, 
                FUN = max))

CodePudding user response:

One tidyverse way could be:

  1. We first transform to a tibble with as_tibble from tibble package.
  2. Generate groups of 10 with gl() function.
  3. Split our tibble of 10000 rows to a list of tibbles with 100 tibble
  4. Apply the map from purrr package with the slice_max function (dplyr package) to get the max value from each of the 100 new tibbles.
  5. Finally use bind_rows() to get them all in your new tibble with 100 rows:

Note (dplyr, tibble, purrr) are in tidyverse

exp_data %>% 
  as_tibble() %>% 
  mutate(group =as.integer(gl(n(),100,n()))) %>% 
  group_split(group) %>%
  map(., ~slice_max(., order_by = value)) %>% 
  <dbl> <int>
 1  5.81     1
 2  6.42     2
 3  4.46     3
 4  4.07     4
 5  5.35     5
 6  5.85     6
 7  4.03     7
 8  5.13     8
 9  4.71     9
10  4.71    10
# … with 90 more rows
  •  Tags:  
  • r
  • Related