I have a tibble with info about users (ID) that use different workstations (WS) to create documents in a certain domain (DM), documents have a certain error rate (ER), and then there is more info (as a made up example PP):
df <- tibble("ID" = c("A","B","A","B","A","B","A", "C", "B","A","B","C"),
"WS" = c("a", "b", "a", "b", "a", "b", "a", "b", "a", "b", "a", "b"),
"DM" = c("r", "r", "p", "p", "r", "r", "p", "p", "r", "r", "p", "p"),
"ER" = c(92.91307, 84.93493, 86.46488, 90.83528, 94.30230, 92.85309,
81.14628, 81.33754, 93.22315, 96.83421, 85.76104, 93.71726 ),
"PP" = c(6, 4, 6, 0, 0, 0, 2, 5, 7, 2, 7, 6))
I need to know how many docs were created by each user on each workstation (numDocs_ID_WS), and what the error rate (avg) was:
df_mutA <- df %>%
group_by(ID, WS) %>%
mutate(numDocs_ID_WS = n(),
avg = mean(ER))
df_mutA
Now where I struggle is to get the number of different workstations (numWS) each user was using. I clearly get the correct number when using summarize():
df_sumB <- df %>%
group_by(ID, WS) %>%
summarize(numDocs_ID_WS = n(),
avg = mean(ER)) %>%
mutate(numWS = length(ID))
df_sumB
But with that, I lose additional columns like PP. Using mutate() on the other hand instead of summarize() gives the wrong result; when running
df_mutB <- df %>%
group_by(ID, WS) %>%
mutate(numDocs_ID_WS = n(),
avg = mean(ER)) %>%
mutate(numWS = length(ID))
df_mutB
the results for columns numDocs_ID_WS and numWS show identical results.
I understand that one way to solve this item would be to create a summary table and use join to get the correct final result; this seems not elegant to me, especially in the light of having to deal with big tables (thousands of rows x hundreds of columns). Did I miss anything? Is there a way to create this column in just one step (I would have hoped grouping gets me closer, but I couldn't find the right way)?
Thank you very much for your ideas and help, it is very much appreciated! Best regards, Saschik
CodePudding user response:
You just need to change the grouping to do this:
df %>%
group_by(ID, WS) %>%
mutate(
numDocs_ID_WS = n(),
avg = mean(ER)) %>%
group_by(ID) %>%
mutate(
numWS = n_distinct(WS)
)
# A tibble: 12 x 8
# Groups: ID [3]
# ID WS DM ER PP numDocs_ID_WS avg numWS
# <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <int>
# 1 A a r 92.9 6 4 88.7 2
# 2 B b r 84.9 4 3 89.5 2
# 3 A a p 86.5 6 4 88.7 2
# 4 B b p 90.8 0 3 89.5 2
# 5 A a r 94.3 0 4 88.7 2
# 6 B b r 92.9 0 3 89.5 2
# 7 A a p 81.1 2 4 88.7 2
# 8 C b p 81.3 5 2 87.5 1
# 9 B a r 93.2 7 2 89.5 2
# 10 A b r 96.8 2 1 96.8 2
# 11 B a p 85.8 7 2 89.5 2
# 12 C b p 93.7 6 2 87.5 1
CodePudding user response:
Having a dedicated table for each type of data (e.g. users, workstations, summary A, summary B) is indeed a very clear way to store your results. Joining tables is a default task in almost all relational databases. The word tidy from tidyverse actually refers to tidy data, which is just a table for every type and a row for every item. This goes hand in hand with object orientated programming (a table for each class and a row for each object of this class). Joining is very fast if the id columns are sorted. Internally, a data frame is just a collection of pointers to various areas of the memory, it just appears to be one coherent thing in the console.