Determine maximum value in column related to specific order id-CodePudding

For data on online orders I have to determine the basket size of separate orders. The information I have is on product_number_within_order per order_id. For example for order id 123456 there are 7 variables with product_number_within_order 1 to 7 (so the basket size is the maximum product_number_within_order). For each order id the basket size should be put in the data table. How can I calculate this best? The order_id's are not in order, which makes it difficult me to program. Now I have the following, however this takes a long time to run so it does not seem optimal to me.

vec <- data.frame(table(unlist(data5$order_id)))

which gives the basket size per order (since it counts the number of times an order_id was mentioned)

    Var1 Freq
1      24409499    2
2      37018675    1
3      49812254    1
4      72349794    1
5     121649820    2
6     123680104    3
7     156423543   11

And the following code to determine the basket size

for (i in 1:length(data5$order_id)) {
  for (j in 1:length(vec$Var1))
  if (data5$order_id[i] != vec$Var1[j]) {
    size <- vec$Freq[j]
    data5$basketsize[i] <- size
  } 
}

For some order_id's it works but for others it is completely wrong. Does anyone know how to do this more efficiently or where the mistake lies?

CodePudding user response：

Assuming data5 is a data frame containing the variable order_id, you can simply:

library(dplyr)

data5 %>% 
  group_by(order_id) %>% 
  summarise(size=n())

For example, here I make a sample dataset and apply the above functions:

> data5 = tibble(order_id = sample(1:5, size=15, replace=T), other_var = 1:15)
> data5
# A tibble: 15 × 2
   order_id other_var
      <int>     <int>
 1        1         1
 2        2         2
 3        2         3
 4        4         4
 5        5         5
 6        1         6
 7        1         7
 8        4         8
 9        2         9
10        1        10
11        2        11
12        4        12
13        2        13
14        4        14
15        2        15
> data5 %>% 
  group_by(order_id) %>% 
  summarise(size=n())
# A tibble: 5 × 2
  order_id  size
     <int> <int>
1        1     4
2        2     1
3        3     1
4        4     5
5        5     4