Home > Enterprise >  Aggregate with respect to condition
Aggregate with respect to condition

Time:11-22

I face the following problem. I have a dataset that I simplify this way

df <- tibble(ID_1=c("a","a", "a","a", "a", "b", "b", "b", "b"), ID_2 = c("a","b", "c","d", "e", "b", "a", "c", "d"), 
             other= c(1, "bu", 34, 3, 5, 6, "jud", 82, 9), exchange=c(1, 13, 34, 3, 5, 6, 8, 82, 9), 
             goods=c("wood","wood", "wood","wood", "wood", "steel", "steel", "steel", "steel"))
df

We have exchanges between multiple countries that have different IDs. I want to sum all the exchanges between country A and country B. Between country A and country C. etc. the goods exchanged. That is, in the end, I'd like to have

ID_1 // ID_2 // Sum_exchangeAB
ID_1 // ID_3 // Sum_exchangeBC
....
ID_n // ID_n // Sumexchangenn

How can I do that in Dplyr? I tried group_by (ID_1, ID_2) then aggregate. Or summarise.

I tried for example the following: dataset2 %>% group_by(ID_1, ID_2, exchange) %>% summarise(dataset2, sum_of_exchange=sum(exchange)) . But this gives me the sum of all the exchange (that is, it returns one value). Not the sum of exchange between two countries.

Thank you for your help, much appreciated!!

CodePudding user response:

You were close:

library(dplyr)

df %>% group_by(ID_1, ID_2) %>% summarise(Sum_Exchange = sum(exchange))

This yields:

# A tibble: 9 x 3
# Groups:   ID_1 [2]
  ID_1  ID_2  Sum_Exchange
  <chr> <chr>       <dbl>
1 a     a               1
2 a     b              13
3 a     c              34
4 a     d               3
5 a     e               5
6 b     a               8
7 b     b               6
8 b     c              82
9 b     d               9

You can then remove identical IDs to omit AA, BB.

df %>% 
  group_by(ID_1, ID_2) %>% 
  summarise(Sum_Exchange = sum(exchange)) %>% 
  filter(ID_1 != ID_2)

add-on from comment

Combine dataframes with {dplyr}'s bind_rows(df1, df2).
Mind you that bind_rows() will generate new columns for variables/columns that do not exist in the original data frame and fill cells of existing columns with no value with NA.
You can force, certain values into the same column/variable by renaming such a variable.

For your problem at hand. Check what you want/need. The original data frame comes with other, goods variables. Thus, these are not existing in the summarised data frame. Doing a bind_rows() will populate these with NA. To get the Sum_Exchange into the exchange variable, rename that column.

CodePudding user response:

Is that what you are looking for?

df %>%
  rowwise() %>%
  mutate(new_id = paste0(str_sort(c(ID_1, ID_2)), collapse = '-')) %>%
  group_by(new_id) %>%
  summarize(exchange = sum(exchange))

which gives:

# A tibble: 8 x 2
  new_id exchange
  <chr>     <dbl>
1 a-a           1
2 a-b          21
3 a-c          34
4 a-d           3
5 a-e           5
6 b-b           6
7 b-c          82
8 b-d           9

UPDATE: it seems the TO wants to append the sum to the original data which can be done with:

df %>%
  rowwise() %>%
  mutate(new_id = paste0(str_sort(c(ID_1, ID_2)), collapse = '-')) %>%
  group_by(new_id) %>%
  mutate(sum_exchange = sum(exchange)) %>%
  ungroup()

which gives:

# A tibble: 9 x 7
  ID_1  ID_2  other exchange goods new_id sum_exchange
  <chr> <chr> <chr>    <dbl> <chr> <chr>         <dbl>
1 a     a     1            1 wood  a-a               1
2 a     b     bu          13 wood  a-b              21
3 a     c     34          34 wood  a-c              34
4 a     d     3            3 wood  a-d               3
5 a     e     5            5 wood  a-e               5
6 b     b     6            6 steel b-b               6
7 b     a     jud          8 steel a-b              21
8 b     c     82          82 steel b-c              82
9 b     d     9            9 steel b-d               9
  • Related