I have a large dataframe that looks simplified like this:
df <- data.frame(Code = c("AUS1", "AUS2", "AUS3", "AUT1", "AUT2", "AUT3", "BEL1", "BEL2", "BEL3"),
AUS1 = c(3, 45, 1, 65, 817, 235, 1223, 234, 867),
AUS2 = c(354, 12, 843, 346, 9754, 123, 4638, 988, 4),
AUS3 = c(67, 82, 9485, 127, 347, 505, 123, 2, 3),
AUT1 = c(1182, 943, 3, 12, 345, 174, 12, 4, 12),
AUT2 = c(78, 8882, 17, 49, 2, 958, 76, 24, 198),
AUT3 = c(1, 99, 300, 17, 389, 234, 122, 62, 91),
BEL1 = c(88, 192, 943, 199, 238, 1294, 1, 4,35),
BEL2 = c(983, 112, 538, 1274, 22, 94, 100, 84, 7),
BEL3 = c(41, 8819, 237, 11, 347, 12, 871, 34, 1))
I know want to summarise each row and each column on two different conditions.
First, I need a sum that excludes the values in which the Code matches the same three laters as the column names. Example: The sum of the first column (AUS1) should exclude the values of the rows with values from the first column (Code) also start with "AUS". For the fourth column (AUT1), the sum should exclude row values which have values of the first column (Code) that start with "AUT" and so on.
The desired output for the colum sums then would be: AUS1 = 3441, AUS2 = 15853, AUS3 = 1107, AUT1 = 2156, AUT2 = 9275, AUT = 675, BEL1 = 2954, BEL2 = 3023, BEL3 = 9467
After that, I would have to sum the rows on the same condition.
Second, I have to again sum each column and each row, but this time it should exclude only the value of the direct match. Example: For the sum of the second column (AUS1), it should only exclude the first row where AUS1 == AUS1. For the third column (AUS2), it should only exclude the value of the second row where AUS2 == AUS2.
Since my dataframes are quite large i cannot do this manually, but rather a function would be helpful.
CodePudding user response:
In an sapply
you may loop over the names
. For the first two jobs grepl
the 1-3 substr
ings, for the second two jobs the name itself, and take the sum
while excluding the Code
column.
sapply(names(df)[-1], \(x) sum(df[!grepl(substr(x, 1, 3), df$Code), x]))
# AUS1 AUS2 AUS3 AUT1 AUT2 AUT3 BEL1 BEL2 BEL3
# 3441 15853 1107 2156 9275 675 2954 3023 9467
sapply(names(df)[-1], \(x) sum(df[df$Code == x, !grepl(substr(x, 1, 3), names(df))][-1]))
# AUS1 AUS2 AUS3 AUT1 AUT2 AUT3 BEL1 BEL2 BEL3
# 2373 19047 2038 2022 11525 2263 6194 1314 1175
sapply(names(df)[-1], \(x) sum(df[!grepl(x, df$Code), x]))
# AUS1 AUS2 AUS3 AUT1 AUT2 AUT3 BEL1 BEL2 BEL3
# 3487 17050 1256 2675 10282 1081 2993 3130 10372
sapply(names(df)[-1], \(x) sum(df[df$Code == x, !grepl(x, names(df))][-1]))
# AUS1 AUS2 AUS3 AUT1 AUT2 AUT3 BEL1 BEL2 BEL3
# 2794 19174 2882 2088 12259 3395 7165 1352 1217
CodePudding user response:
Task 1:
- convert the table into long format
- create two dummy variable that take the first 3 character of the column name (variable) and row name (Code)
- exclude all values when the two dummy variable matches
- group by the column name (variable) and summarise to give a sum for each
library(tidyverse)
df2 <- df %>% gather(variable, value, -Code)
df2 <- df2 %>% mutate(code_ = substr(Code, 1, 3),
var_ = substr(variable, 1, 3))
df2 <- df2 %>% filter(!code_ == var_)
df2 %>% group_by(variable) %>% summarise(sum(value))
# A tibble: 9 x 2
variable `sum(value)`
<chr> <dbl>
1 AUS1 3441
2 AUS2 15853
3 AUS3 1107
4 AUT1 2156
5 AUT2 9275
6 AUT3 675
7 BEL1 2954
8 BEL2 3023
9 BEL3 9467
Task 2:
- Essentially the same as task 1, only without the need to rely on dummy variable.
df3 <- df %>% gather(variable, value, -Code)
df3 <- df3 %>% filter(!Code == variable)
df3 %>% group_by(variable) %>% summarise(sum(value))
# A tibble: 9 x 2
variable `sum(value)`
<chr> <dbl>
1 AUS1 3487
2 AUS2 17050
3 AUS3 1256
4 AUT1 2675
5 AUT2 10282
6 AUT3 1081
7 BEL1 2993
8 BEL2 3130
9 BEL3 10372
CodePudding user response:
Condition1
library(dplyr)
library(janitor)
df %>%
group_by(x = substr(Code, 1,3)) %>%
mutate(across(-Code, ~case_when(Code %in% cur_column() ~ NA_real_,
TRUE ~.))) %>%
mutate(across(-Code, ~case_when(any(is.na(.)) ~ NA_real_,
TRUE ~ .))) %>%
janitor::adorn_totals() %>%
ungroup() %>%
filter(row_number()==n()) %>%
select(-x) %>%
t() %>%
data.frame()
.
Code Total
AUS1 3441
AUS2 15853
AUS3 1107
AUT1 2156
AUT2 9275
AUT3 675
BEL1 2954
BEL2 3023
Condition2
library(dplyr)
library(janitor)
df %>%
mutate(across(-Code, ~case_when(Code %in% cur_column() ~ NA_real_,
TRUE ~.))) %>%
janitor::adorn_totals() %>%
filter(row_number()==n()) %>%
t() %>%
data.frame()
.
Code Total
AUS1 3487
AUS2 17050
AUS3 1256
AUT1 2675
AUT2 10282
AUT3 1081
BEL1 2993
BEL2 3130
BEL3 10372