Home > Software engineering >  How to summarize with R without losing additional column information?
How to summarize with R without losing additional column information?

Time:06-29

I have a tibble with info about users (ID) that use different workstations (WS) to create documents in a certain domain (DM), documents have a certain error rate (ER), and then there is more info (as a made up example PP):

df <- tibble("ID" = c("A","B","A","B","A","B","A", "C", "B","A","B","C"), 
                      "WS" = c("a", "b", "a", "b", "a", "b", "a", "b", "a", "b", "a", "b"), 
                      "DM" = c("r", "r", "p", "p", "r", "r", "p", "p", "r", "r", "p", "p"),
                      "ER" = c(92.91307, 84.93493, 86.46488, 90.83528, 94.30230, 92.85309,
                               81.14628, 81.33754, 93.22315, 96.83421, 85.76104, 93.71726 ),
                      "PP" = c(6, 4, 6, 0, 0, 0, 2, 5, 7, 2, 7, 6))

I need to know how many docs were created by each user on each workstation (numDocs_ID_WS), and what the error rate (avg) was:

df_mutA <- df %>%
  group_by(ID, WS) %>%
  mutate(numDocs_ID_WS = n(),
         avg = mean(ER))
df_mutA

Now where I struggle is to get the number of different workstations (numWS) each user was using. I clearly get the correct number when using summarize():

df_sumB <- df %>% 
  group_by(ID, WS) %>% 
  summarize(numDocs_ID_WS = n(),
            avg = mean(ER)) %>% 
  mutate(numWS = length(ID))
df_sumB

But with that, I lose additional columns like PP. Using mutate() on the other hand instead of summarize() gives the wrong result; when running

df_mutB <- df %>%
  group_by(ID, WS) %>%
  mutate(numDocs_ID_WS = n(),
         avg = mean(ER)) %>% 
  mutate(numWS = length(ID))
df_mutB

the results for columns numDocs_ID_WS and numWS show identical results.

I understand that one way to solve this item would be to create a summary table and use join to get the correct final result; this seems not elegant to me, especially in the light of having to deal with big tables (thousands of rows x hundreds of columns). Did I miss anything? Is there a way to create this column in just one step (I would have hoped grouping gets me closer, but I couldn't find the right way)?

Thank you very much for your ideas and help, it is very much appreciated! Best regards, Saschik

CodePudding user response:

You just need to change the grouping to do this:

df %>%
  group_by(ID, WS) %>%
  mutate(
    numDocs_ID_WS = n(),
         avg = mean(ER))   %>% 
    group_by(ID)   %>% 
    mutate(
        numWS = n_distinct(WS)
    )

# A tibble: 12 x 8
# Groups:   ID [3]
#    ID    WS    DM       ER    PP numDocs_ID_WS   avg numWS
#    <chr> <chr> <chr> <dbl> <dbl>         <int> <dbl> <int>
#  1 A     a     r      92.9     6             4  88.7     2
#  2 B     b     r      84.9     4             3  89.5     2
#  3 A     a     p      86.5     6             4  88.7     2
#  4 B     b     p      90.8     0             3  89.5     2
#  5 A     a     r      94.3     0             4  88.7     2
#  6 B     b     r      92.9     0             3  89.5     2
#  7 A     a     p      81.1     2             4  88.7     2
#  8 C     b     p      81.3     5             2  87.5     1
#  9 B     a     r      93.2     7             2  89.5     2
# 10 A     b     r      96.8     2             1  96.8     2
# 11 B     a     p      85.8     7             2  89.5     2
# 12 C     b     p      93.7     6             2  87.5     1

CodePudding user response:

Having a dedicated table for each type of data (e.g. users, workstations, summary A, summary B) is indeed a very clear way to store your results. Joining tables is a default task in almost all relational databases. The word tidy from tidyverse actually refers to tidy data, which is just a table for every type and a row for every item. This goes hand in hand with object orientated programming (a table for each class and a row for each object of this class). Joining is very fast if the id columns are sorted. Internally, a data frame is just a collection of pointers to various areas of the memory, it just appears to be one coherent thing in the console.

  •  Tags:  
  • r
  • Related