Create a new column based on the sum of values from another column, with dplyr-CodePudding

I'd like to create a new variable called POPULATION that takes up the sum of the values of the variable P1 grouped by the variable CODASC. It seemed easy to me at the beginning, but I'm eventually struggling. Since I have to do this for a lot of variables and for several datasets, I really need a quick way of doing it! If anyone can help me, I would really appreciate it! Many thanks, Ilaria

My data frame looks like that:

PROCOM      SEZ2011   SEZ CODASC    P1   P47   P62  P131    E1    E3  ST15   A46
   <int>        <dbl> <int>  <int> <int> <int> <int> <int> <int> <int> <int> <int>
1  48017 480000000000 60001      4   251    25     9    20    70    40    19    20
2  48017 480000000000 60002      3    15     1     0     1     4     4     0     3
3  48017 480000000000 60003      2    20     7     2     1     1     1     1     1
4  48017 480000000000 60004      3   253    21     4    10    63    40    49    22
5  48017 480000000000 60005      5     3     0     1     0     1     1     0     2
6  48017 480000000000 60006      1   161    19     7     5    27    17    26    13
>

And my code looks like that:

df <- df %>%
  group_by(CODASC) %>%
  mutate(POPULATION = sum(P1 , na.rm= T))

CodePudding user response：

To apply sum within a group across multiple variables you could do, as an example:

library(dplyr)

df %>% 
  group_by(CODASC) %>% 
  mutate(across(P1:last_col(), sum, .names = "{.col}_sum")) %>% 
  ungroup()

To apply this across multiple data frames (if you're grouping by the same variable and summing the same columns) you can iterate through them easily if they're in a list and with the purrr library:

library(purrr)
library(dplyr)

l <- list(df, df, df)

map(l, ~ .x %>% 
      group_by(CODASC) %>% 
      mutate(across(P1:last_col(), sum, .names = "{.col}_sum")) %>% 
      ungroup())

CodePudding user response：

Your code looks good. It looks like your first 4 columns are some identifiers. If you want to summarise all remaining columns you can do something like:

df <- df %>%
  group_by(PROCOM, SEZ2011, SEZ, CODASC) %>%
  summarise_all(sum) ## or whatever function you want here

see https://dplyr.tidyverse.org/reference/summarise_all.html for more details. If you want to create a function to apply to many datasets, perhaps check out making functions: https://swcarpentry.github.io/r-novice-inflammation/02-func-R/ and apply functions