I have data similar to this
example_data <- data.frame(
company = c(rep("A",6),
rep("B",6),
rep("C",6)),
year = c(rep(c(rep(c(2019), 3), rep(2020, 3)), 3)),
country = c(rep(c("Australia","Tanzania","Nepal"),3)),
sales = c(sample(1000:2000, 18)),
employees = c(sample(100:200, 18)),
profit = c(sample(500:1000, 18))
)
which when printed out looks like this:
> example_data
company year country sales employees profit
1 A 2019 Australia 1815 138 986
2 A 2019 Tanzania 1183 126 907
3 A 2019 Nepal 1159 155 939
4 A 2020 Australia 1873 183 866
5 A 2020 Tanzania 1858 198 579
6 A 2020 Nepal 1841 184 601
7 B 2019 Australia 1989 160 595
8 B 2019 Tanzania 1162 151 520
9 B 2019 Nepal 1470 187 670
10 B 2020 Australia 1013 128 945
11 B 2020 Tanzania 1718 123 886
12 B 2020 Nepal 1135 149 778
13 C 2019 Australia 1846 188 755
14 C 2019 Tanzania 1445 194 916
15 C 2019 Nepal 1029 145 903
16 C 2020 Australia 1737 161 578
17 C 2020 Tanzania 1489 141 859
18 C 2020 Nepal 1350 167 536
The unit of observation for the three variables of interest sales
, employees
, profit
is a unique combination of company
, year
, and country
.
What I need is a column in the data frame for every one of these three variables named other_sales
, other_employees
, and other_profit
. (In my actual data, I have not only three but closer to 40 such variables of interest.) These should be the sum of the other companies in that year, in that country for that variable. So for instance, example_data$other_sales[1]
should be the sum of the two values 1989 and 1846, which are "he sales
for company B
in that year
in that country
, and the sales
for company C
in that year
in that country
respectively.
I am familiar with dplyr::group_by()
and dplyr::mutate()
, but I struggle to come up with a way to solve this problem. What I would like to do is something like this:
library(dplyr)
example_data %>%
group_by(company, year, country) %>%
mutate(other_sales = sum(
example_data %>% filter(company!="this") %>% .$sales)
)
# "this" should be the value of 'company' in the current group
Obviously, this code doesn't work. Even if it did, it would not accomplish the goal of creating these other_*
variables automatically for every specified column in the data frame. I've thought about creating a complicated for loop, but I figured before I go down that most likely wrong route, it's better to ask here. Finally, while it would be possible to construct a solution based purely on column indices (i.e., for example_data[1,7] calculate the sum of [7,4] and [13,4]), this would not work in my real data because the number of observations per company can differ.
EDIT: small correction in the code
CodePudding user response:
I think this will solve your problem
example_data %>%
group_by(country,year) %>%
mutate(other_sales = sum(sales)- sales)
CodePudding user response:
To generalise it for all variables, i.e. sales, profit and employees:
library(tidyverse)
set.seed(123)
example_data <- data.frame(
company = c(rep("A",6),
rep("B",6),
rep("C",6)),
year = c(rep(c(rep(c(2019), 3), rep(2020, 3)), 3)),
country = c(rep(c("Australia","Tanzania","Nepal"),3)),
sales = c(sample(1000:2000, 18)),
employees = c(sample(100:200, 18)),
profit = c(sample(500:1000, 18))
)
example_data |>
arrange(country, year, company) |> # Optional
group_by(country, year) |>
mutate(across(sales:profit, ~sum(.) - ., .names = "other_{.col}"))
#> # A tibble: 18 × 9
#> # Groups: country, year [6]
#> company year country sales employees profit other_sales other_em…¹ other…²
#> <chr> <dbl> <chr> <int> <int> <int> <int> <int> <int>
#> 1 A 2019 Australia 1414 190 989 3190 302 1515
#> 2 B 2019 Australia 1817 125 522 2787 367 1982
#> 3 C 2019 Australia 1373 177 993 3231 315 1511
#> 4 A 2020 Australia 1525 108 892 2830 372 1524
#> 5 B 2020 Australia 1228 197 808 3127 283 1608
#> 6 C 2020 Australia 1602 175 716 2753 305 1700
#> 7 A 2019 Nepal 1178 191 762 2899 283 1608
#> 8 B 2019 Nepal 1298 141 943 2779 333 1427
#> 9 C 2019 Nepal 1601 142 665 2476 332 1705
#> 10 A 2020 Nepal 1937 171 829 2721 266 1967
#> 11 B 2020 Nepal 1013 135 991 3645 302 1805
#> 12 C 2020 Nepal 1708 131 976 2950 306 1820
#> 13 A 2019 Tanzania 1462 156 608 2781 286 1633
#> 14 B 2019 Tanzania 1117 106 910 3126 336 1331
#> 15 C 2019 Tanzania 1664 180 723 2579 262 1518
#> 16 A 2020 Tanzania 1194 192 924 3010 296 1423
#> 17 B 2020 Tanzania 1243 182 634 2961 306 1713
#> 18 C 2020 Tanzania 1767 114 789 2437 374 1558
#> # … with abbreviated variable names ¹other_employees, ²other_profit
Created on 2022-12-08 with reprex v2.0.2