For data on online orders I have to determine the basket size of separate orders. The information I have is on product_number_within_order per order_id. For example for order id 123456 there are 7 variables with product_number_within_order 1 to 7 (so the basket size is the maximum product_number_within_order). For each order id the basket size should be put in the data table. How can I calculate this best? The order_id's are not in order, which makes it difficult me to program. Now I have the following, however this takes a long time to run so it does not seem optimal to me.
vec <- data.frame(table(unlist(data5$order_id)))
which gives the basket size per order (since it counts the number of times an order_id was mentioned)
Var1 Freq
1 24409499 2
2 37018675 1
3 49812254 1
4 72349794 1
5 121649820 2
6 123680104 3
7 156423543 11
And the following code to determine the basket size
for (i in 1:length(data5$order_id)) {
for (j in 1:length(vec$Var1))
if (data5$order_id[i] != vec$Var1[j]) {
size <- vec$Freq[j]
data5$basketsize[i] <- size
}
}
For some order_id's it works but for others it is completely wrong. Does anyone know how to do this more efficiently or where the mistake lies?
CodePudding user response:
Assuming data5
is a data frame containing the variable order_id
, you can simply:
library(dplyr)
data5 %>%
group_by(order_id) %>%
summarise(size=n())
For example, here I make a sample dataset and apply the above functions:
> data5 = tibble(order_id = sample(1:5, size=15, replace=T), other_var = 1:15)
> data5
# A tibble: 15 × 2
order_id other_var
<int> <int>
1 1 1
2 2 2
3 2 3
4 4 4
5 5 5
6 1 6
7 1 7
8 4 8
9 2 9
10 1 10
11 2 11
12 4 12
13 2 13
14 4 14
15 2 15
> data5 %>%
group_by(order_id) %>%
summarise(size=n())
# A tibble: 5 × 2
order_id size
<int> <int>
1 1 4
2 2 1
3 3 1
4 4 5
5 5 4