I have a df in which each entry has an assigned grouped ID, like:

id_person income id_family
001       100    A1
002       120    B2
003       100    B5
004       220    A1
005       300    A2

And now I need to assign a new variable for each id_person, summing up the variable income of all entries with the same id_family. Such as:

id_person income id_family family_income
001       100    A1        220
002       120    B2        120
003       100    B5        400
004       220    A1        230
005       300    B5        400

I do not wish to create a new grouped_df, neither do I need to see this variable summarized for now.

How can I give this command using R?

CodePudding user response：

You may try

Data

df <- read.table(text = "id_person income id_family
001       100    A1
002       120    B2
003       100    B5
004       220    A1
005       300    B5", header = T)

  id_person income id_family
1         1    100        A1
2         2    120        B2
3         3    100        B5
4         4    220        A1
5         5    300        B5

Code

library(dplyr)

df %>%
  group_by(id_family) %>%
  mutate(family_income = sum(income)) %>%
  ungroup
    
  id_person income id_family family_income
      <int>  <int> <chr>             <int>
1         1    100 A1                  320
2         2    120 B2                  120
3         3    100 B5                  400
4         4    220 A1                  320
5         5    300 B5                  400

CodePudding user response：

df %>% 
  group_by(id_family) %>% 
  summarise(Total = sum(income, na.rm = TRUE))

or the across() function could be used:

df %>%
  group_by(id_family) %>%
  summarise(across(income, sum))

CodePudding user response：

Using data in a data.table, we can perform operations on variables by a grouping variable (in by=), then assign that back to the data using the data.table assignment operator :=

library(data.table)
setDT(df)
df[, "family_income" := sum(income), by = id_family]

The data.table data structure is a pumped up version of the R data.frame, giving added functionality and efficiency gains. If DT is your data.table, DT[i, j, by] is the notation showing how we can use i to sort or subset data, j for selecting or computing on variables, and by to perfrom j-operations on groups. For example, for cars with over 100 horsepower, what is the mean fuel efficiency for automatic (0) and manual (1) cars?

dtcars <- data.table(mtcars)
dtcars[hp>100, mean(mpg), by=am]

Returns:

> dtcars[hp>100, mean(mpg), by=am]
   am       V1
1:  1 20.61429
2:  0 16.06875