How to summarize with R without losing additional column information?-CodePudding

I have a tibble with info about users (ID) that use different workstations (WS) to create documents in a certain domain (DM), documents have a certain error rate (ER), and then there is more info (as a made up example PP):

df <- tibble("ID" = c("A","B","A","B","A","B","A", "C", "B","A","B","C"), 
                      "WS" = c("a", "b", "a", "b", "a", "b", "a", "b", "a", "b", "a", "b"), 
                      "DM" = c("r", "r", "p", "p", "r", "r", "p", "p", "r", "r", "p", "p"),
                      "ER" = c(92.91307, 84.93493, 86.46488, 90.83528, 94.30230, 92.85309,
                               81.14628, 81.33754, 93.22315, 96.83421, 85.76104, 93.71726 ),
                      "PP" = c(6, 4, 6, 0, 0, 0, 2, 5, 7, 2, 7, 6))

I need to know how many docs were created by each user on each workstation (numDocs_ID_WS), and what the error rate (avg) was:

df_mutA <- df %>%
  group_by(ID, WS) %>%
  mutate(numDocs_ID_WS = n(),
         avg = mean(ER))
df_mutA

Now where I struggle is to get the number of different workstations (numWS) each user was using. I clearly get the correct number when using summarize():

df_sumB <- df %>% 
  group_by(ID, WS) %>% 
  summarize(numDocs_ID_WS = n(),
            avg = mean(ER)) %>% 
  mutate(numWS = length(ID))
df_sumB

But with that, I lose additional columns like PP. Using mutate() on the other hand instead of summarize() gives the wrong result; when running

df_mutB <- df %>%
  group_by(ID, WS) %>%
  mutate(numDocs_ID_WS = n(),
         avg = mean(ER)) %>% 
  mutate(numWS = length(ID))
df_mutB

the results for columns numDocs_ID_WS and numWS show identical results.

I understand that one way to solve this item would be to create a summary table and use join to get the correct final result; this seems not elegant to me, especially in the light of having to deal with big tables (thousands of rows x hundreds of columns). Did I miss anything? Is there a way to create this column in just one step (I would have hoped grouping gets me closer, but I couldn't find the right way)?

Thank you very much for your ideas and help, it is very much appreciated! Best regards, Saschik

CodePudding user response：

You just need to change the grouping to do this:

df %>%
  group_by(ID, WS) %>%
  mutate(
    numDocs_ID_WS = n(),
         avg = mean(ER))   %>% 
    group_by(ID)   %>% 
    mutate(
        numWS = n_distinct(WS)
    )

# A tibble: 12 x 8
# Groups:   ID [3]
#    ID    WS    DM       ER    PP numDocs_ID_WS   avg numWS
#    <chr> <chr> <chr> <dbl> <dbl>         <int> <dbl> <int>
#  1 A     a     r      92.9     6             4  88.7     2
#  2 B     b     r      84.9     4             3  89.5     2
#  3 A     a     p      86.5     6             4  88.7     2
#  4 B     b     p      90.8     0             3  89.5     2
#  5 A     a     r      94.3     0             4  88.7     2
#  6 B     b     r      92.9     0             3  89.5     2
#  7 A     a     p      81.1     2             4  88.7     2
#  8 C     b     p      81.3     5             2  87.5     1
#  9 B     a     r      93.2     7             2  89.5     2
# 10 A     b     r      96.8     2             1  96.8     2
# 11 B     a     p      85.8     7             2  89.5     2
# 12 C     b     p      93.7     6             2  87.5     1

CodePudding user response：

Having a dedicated table for each type of data (e.g. users, workstations, summary A, summary B) is indeed a very clear way to store your results. Joining tables is a default task in almost all relational databases. The word tidy from tidyverse actually refers to tidy data, which is just a table for every type and a row for every item. This goes hand in hand with object orientated programming (a table for each class and a row for each object of this class). Joining is very fast if the id columns are sorted. Internally, a data frame is just a collection of pointers to various areas of the memory, it just appears to be one coherent thing in the console.