Home > Back-end >  Summing columns conditional on matches with the value of the first column
Summing columns conditional on matches with the value of the first column

Time:05-27

I have a large dataframe that looks simplified like this:

df <- data.frame(Code = c("AUS1", "AUS2", "AUS3", "AUT1", "AUT2", "AUT3", "BEL1", "BEL2", "BEL3"),
                 AUS1 = c(3, 45, 1, 65, 817, 235, 1223, 234, 867),
                 AUS2 = c(354, 12, 843, 346, 9754, 123, 4638, 988, 4),
                 AUS3 = c(67, 82, 9485, 127, 347, 505, 123, 2, 3),
                 AUT1 = c(1182, 943, 3, 12, 345, 174, 12, 4, 12),
                 AUT2 = c(78, 8882, 17, 49, 2, 958, 76, 24, 198),
                 AUT3 = c(1, 99, 300, 17, 389, 234, 122, 62, 91),
                 BEL1 = c(88, 192, 943, 199, 238, 1294, 1, 4,35),
                 BEL2 = c(983, 112, 538, 1274, 22, 94, 100, 84, 7),
                 BEL3 = c(41, 8819, 237, 11, 347, 12, 871, 34, 1))

I know want to summarise each row and each column on two different conditions.

First, I need a sum that excludes the values in which the Code matches the same three laters as the column names. Example: The sum of the first column (AUS1) should exclude the values of the rows with values from the first column (Code) also start with "AUS". For the fourth column (AUT1), the sum should exclude row values which have values of the first column (Code) that start with "AUT" and so on.

The desired output for the colum sums then would be: AUS1 = 3441, AUS2 = 15853, AUS3 = 1107, AUT1 = 2156, AUT2 = 9275, AUT = 675, BEL1 = 2954, BEL2 = 3023, BEL3 = 9467

After that, I would have to sum the rows on the same condition.

Second, I have to again sum each column and each row, but this time it should exclude only the value of the direct match. Example: For the sum of the second column (AUS1), it should only exclude the first row where AUS1 == AUS1. For the third column (AUS2), it should only exclude the value of the second row where AUS2 == AUS2.

Since my dataframes are quite large i cannot do this manually, but rather a function would be helpful.

CodePudding user response:

In an sapply you may loop over the names. For the first two jobs grepl the 1-3 substrings, for the second two jobs the name itself, and take the sum while excluding the Code column.

sapply(names(df)[-1], \(x) sum(df[!grepl(substr(x, 1, 3), df$Code), x]))
# AUS1  AUS2  AUS3  AUT1  AUT2  AUT3  BEL1  BEL2  BEL3 
# 3441 15853  1107  2156  9275   675  2954  3023  9467

sapply(names(df)[-1], \(x) sum(df[df$Code == x, !grepl(substr(x, 1, 3), names(df))][-1]))
# AUS1  AUS2  AUS3  AUT1  AUT2  AUT3  BEL1  BEL2  BEL3 
# 2373 19047  2038  2022 11525  2263  6194  1314  1175 

sapply(names(df)[-1], \(x) sum(df[!grepl(x, df$Code), x]))
# AUS1  AUS2  AUS3  AUT1  AUT2  AUT3  BEL1  BEL2  BEL3 
# 3487 17050  1256  2675 10282  1081  2993  3130 10372 

sapply(names(df)[-1], \(x) sum(df[df$Code == x, !grepl(x, names(df))][-1]))
# AUS1  AUS2  AUS3  AUT1  AUT2  AUT3  BEL1  BEL2  BEL3 
# 2794 19174  2882  2088 12259  3395  7165  1352  1217

CodePudding user response:

Task 1:

  • convert the table into long format
  • create two dummy variable that take the first 3 character of the column name (variable) and row name (Code)
  • exclude all values when the two dummy variable matches
  • group by the column name (variable) and summarise to give a sum for each
library(tidyverse)
df2 <- df %>% gather(variable, value, -Code)
df2 <- df2 %>% mutate(code_ = substr(Code, 1, 3),
                      var_ = substr(variable, 1, 3))
df2 <- df2 %>% filter(!code_ == var_)
df2 %>% group_by(variable) %>% summarise(sum(value))
# A tibble: 9 x 2
  variable `sum(value)`
  <chr>           <dbl>
1 AUS1             3441
2 AUS2            15853
3 AUS3             1107
4 AUT1             2156
5 AUT2             9275
6 AUT3              675
7 BEL1             2954
8 BEL2             3023
9 BEL3             9467

Task 2:

  • Essentially the same as task 1, only without the need to rely on dummy variable.
df3 <- df %>% gather(variable, value, -Code)
df3 <- df3 %>% filter(!Code == variable)
df3 %>% group_by(variable) %>% summarise(sum(value))
# A tibble: 9 x 2
  variable `sum(value)`
  <chr>           <dbl>
1 AUS1             3487
2 AUS2            17050
3 AUS3             1256
4 AUT1             2675
5 AUT2            10282
6 AUT3             1081
7 BEL1             2993
8 BEL2             3130
9 BEL3            10372

CodePudding user response:

Condition1

library(dplyr)
library(janitor)

df %>% 
  group_by(x = substr(Code, 1,3)) %>% 
  mutate(across(-Code, ~case_when(Code %in% cur_column() ~ NA_real_, 
                                   TRUE ~.))) %>% 
  mutate(across(-Code, ~case_when(any(is.na(.)) ~ NA_real_,
                                  TRUE ~ .))) %>% 
  janitor::adorn_totals() %>% 
  ungroup() %>% 
  filter(row_number()==n()) %>% 
  select(-x) %>% 
  t() %>% 
  data.frame()
        .
Code Total
AUS1  3441
AUS2 15853
AUS3  1107
AUT1  2156
AUT2  9275
AUT3   675
BEL1  2954
BEL2  3023

Condition2

library(dplyr)
library(janitor)

df %>% 
  mutate(across(-Code, ~case_when(Code %in% cur_column() ~ NA_real_, 
                                  TRUE ~.))) %>% 
  janitor::adorn_totals() %>% 
  filter(row_number()==n()) %>% 
  t() %>% 
  data.frame()
         .
Code Total
AUS1  3487
AUS2 17050
AUS3  1256
AUT1  2675
AUT2 10282
AUT3  1081
BEL1  2993
BEL2  3130
BEL3 10372
  •  Tags:  
  • r
  • Related