Home > other >  Create a new variable in data frame that contains the sum of the values of all other groups
Create a new variable in data frame that contains the sum of the values of all other groups

Time:12-09

I have data similar to this

example_data <- data.frame(
  company   = c(rep("A",6),
                rep("B",6),
                rep("C",6)),
  year      = c(rep(c(rep(c(2019), 3), rep(2020, 3)), 3)),
  country   = c(rep(c("Australia","Tanzania","Nepal"),3)),
  sales     = c(sample(1000:2000, 18)),
  employees = c(sample(100:200, 18)),
  profit    = c(sample(500:1000, 18))
)

which when printed out looks like this:

> example_data
   company year   country sales employees profit
1        A 2019 Australia  1815       138    986
2        A 2019  Tanzania  1183       126    907
3        A 2019     Nepal  1159       155    939
4        A 2020 Australia  1873       183    866
5        A 2020  Tanzania  1858       198    579
6        A 2020     Nepal  1841       184    601
7        B 2019 Australia  1989       160    595
8        B 2019  Tanzania  1162       151    520
9        B 2019     Nepal  1470       187    670
10       B 2020 Australia  1013       128    945
11       B 2020  Tanzania  1718       123    886
12       B 2020     Nepal  1135       149    778
13       C 2019 Australia  1846       188    755
14       C 2019  Tanzania  1445       194    916
15       C 2019     Nepal  1029       145    903
16       C 2020 Australia  1737       161    578
17       C 2020  Tanzania  1489       141    859
18       C 2020     Nepal  1350       167    536

The unit of observation for the three variables of interest sales, employees, profit is a unique combination of company, year, and country.

What I need is a column in the data frame for every one of these three variables named other_sales, other_employees, and other_profit. (In my actual data, I have not only three but closer to 40 such variables of interest.) These should be the sum of the other companies in that year, in that country for that variable. So for instance, example_data$other_sales[1] should be the sum of the two values 1989 and 1846, which are "he sales for company B in that year in that country, and the sales for company C in that year in that country respectively.

I am familiar with dplyr::group_by() and dplyr::mutate(), but I struggle to come up with a way to solve this problem. What I would like to do is something like this:

library(dplyr)
example_data %>%
 group_by(company, year, country) %>%
 mutate(other_sales = sum(
   example_data %>% filter(company!="this") %>% .$sales)
 )
# "this" should be the value of 'company' in the current group

Obviously, this code doesn't work. Even if it did, it would not accomplish the goal of creating these other_* variables automatically for every specified column in the data frame. I've thought about creating a complicated for loop, but I figured before I go down that most likely wrong route, it's better to ask here. Finally, while it would be possible to construct a solution based purely on column indices (i.e., for example_data[1,7] calculate the sum of [7,4] and [13,4]), this would not work in my real data because the number of observations per company can differ.

EDIT: small correction in the code

CodePudding user response:

I think this will solve your problem

example_data %>% 
  group_by(country,year) %>% 
  mutate(other_sales = sum(sales)- sales)

CodePudding user response:

To generalise it for all variables, i.e. sales, profit and employees:

library(tidyverse)

set.seed(123)

example_data <- data.frame(
  company   = c(rep("A",6),
                rep("B",6),
                rep("C",6)),
  year      = c(rep(c(rep(c(2019), 3), rep(2020, 3)), 3)),
  country   = c(rep(c("Australia","Tanzania","Nepal"),3)),
  sales     = c(sample(1000:2000, 18)),
  employees = c(sample(100:200, 18)),
  profit    = c(sample(500:1000, 18))
)

example_data |> 
  arrange(country, year, company) |> # Optional 
  group_by(country, year) |> 
  mutate(across(sales:profit, ~sum(.) - ., .names = "other_{.col}"))
#> # A tibble: 18 × 9
#> # Groups:   country, year [6]
#>    company  year country   sales employees profit other_sales other_em…¹ other…²
#>    <chr>   <dbl> <chr>     <int>     <int>  <int>       <int>      <int>   <int>
#>  1 A        2019 Australia  1414       190    989        3190        302    1515
#>  2 B        2019 Australia  1817       125    522        2787        367    1982
#>  3 C        2019 Australia  1373       177    993        3231        315    1511
#>  4 A        2020 Australia  1525       108    892        2830        372    1524
#>  5 B        2020 Australia  1228       197    808        3127        283    1608
#>  6 C        2020 Australia  1602       175    716        2753        305    1700
#>  7 A        2019 Nepal      1178       191    762        2899        283    1608
#>  8 B        2019 Nepal      1298       141    943        2779        333    1427
#>  9 C        2019 Nepal      1601       142    665        2476        332    1705
#> 10 A        2020 Nepal      1937       171    829        2721        266    1967
#> 11 B        2020 Nepal      1013       135    991        3645        302    1805
#> 12 C        2020 Nepal      1708       131    976        2950        306    1820
#> 13 A        2019 Tanzania   1462       156    608        2781        286    1633
#> 14 B        2019 Tanzania   1117       106    910        3126        336    1331
#> 15 C        2019 Tanzania   1664       180    723        2579        262    1518
#> 16 A        2020 Tanzania   1194       192    924        3010        296    1423
#> 17 B        2020 Tanzania   1243       182    634        2961        306    1713
#> 18 C        2020 Tanzania   1767       114    789        2437        374    1558
#> # … with abbreviated variable names ¹​other_employees, ²​other_profit

Created on 2022-12-08 with reprex v2.0.2

  • Related