I have a data frame (nxm
) in R, namely A
, like this
A <- as.data.frame(matrix(rnorm(50 * 10, mean = 0, sd = 1), 50, 10))
colnames(A) <-
c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")
For every fifth row in the data (A[1, ], A[6, ], A[11, ], A[16, ],...,A[46, ]
), I want see if the column X1
contains the greatest value across all of the columns starting with "X"
for that row. In the end, I would like to have a single number reflecting the percentage of these rows for which column X1
contains the largest value across the columns.
CodePudding user response:
Something like:
sA <- A[seq.int(1, nrow(A), by = 5), ] ## keep rows 1, 6, 11, ...
logi <- rowSums(sA$X1 > sA[-1]) == (ncol(A) - 1) ## X1 > other columns
mean(logi) ## the percentage of the above event
In the case where negative values appear, it does not seem to work. I produced a row with X1= -3.12, X2= -3.44, X3=-3.71, X4=-4.34, .... I know X1 is the highest but appears as FALSE in
logi
. Why is that?
The row you presented in your screenshot is on the 2nd row. You want to skip the 2nd row, don't you? logi
is computed on row 1, 6, 11, .... In the end, logi
has 10 not 50 logical values. So the 2nd FALSE is not for row 2, but for row 6.
Now, seems to work perfectly! Thank you for your help!
CodePudding user response:
Here is how we could do it:
Create an id column with rownumber (later important to get the desired rows
Here we filter out row 6, 11, 16, etc... as desired. By using the modulo operator %% we simply check if the row number is divisible by 5 and then add 1
mutate(across(-c(X1, id), ~. < X1, 1, 0)*1) %>%
this looks complicated but is not: here we mutate across all columns but X1, and id then within an anonymous function (~) we check if all (.) e.g. columns X2:X10 are smaller X1, if so 1 else 0. The *1 makes 0 1 out of FALSE and TRUENow we check if the rowsums X2:X10 are equal to 9 (because 9 rows, if all are 1 eg. TRUE then it should be 9) if yes 1 else 0.
save to dataframe B
Next we use df A, after adding an id column we join with df B
caluclate the AR wtih sum of check column divided by rows in df A *100
slice 1 to pull only one AR
library(dplyr)
B <- A %>%
mutate(id = row_number()) %>%
filter(row_number()%%5==0 1) %>%
mutate(across(-c(X1, id), ~. < X1, 1, 0)*1) %>%
mutate(x = rowSums(. [2:10])) %>%
mutate(check = ifelse(x==9, 1, 0)) %>%
select(id, check)
A %>%
mutate(id = row_number()) %>%
left_join(B, by="id") %>%
select(-id) %>%
mutate(AR = paste0(sum(check, na.rm = TRUE)/nrow(B)*100, "%")) %>%
slice(1) %>%
pull(AR)
CodePudding user response:
We can just extract every fifth row with filter()
, turn it into long data structure with pivot_longer()
, then pull out variables with the maximum per row with group_by()
and slice_max()
then create indicators for the variable value.
Example data
You'll need to set.seed
to create reproducible data.
library(dplyr)
library(tidyr)
set.seed(123)
A <- as.data.frame(matrix(rnorm(50 * 10, mean = 0, sd = 1), 50, 10))
colnames(A) <-
c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")
Get desired percentages
# Add row id and only keep 5th rows
A5 <- A %>%
mutate(row_id = 1:nrow(A)) %>%
filter(row_number() %% 6 == 0 |
row_number() == 1)
# Pivot so each variable is its own row
A5_long <- A5 %>%
pivot_longer(cols = starts_with("X"),
names_to = "variable",
values_to = "values")
# Get maximum value for each row then
# check if the variable is X1 and add indicator
results <- A5_long %>%
group_by(row_id) %>%
slice_max(values) %>%
mutate(max_is_X1 = ifelse(variable == "X1", 1, 0))
print(results)
#> # A tibble: 9 x 4
#> # Groups: row_id [9]
#> row_id variable values max_is_X1
#> <int> <chr> <dbl> <dbl>
#> 1 1 X5 2.20 0
#> 2 6 X1 1.72 1
#> 3 12 X7 0.757 0
#> 4 18 X10 1.72 0
#> 5 24 X4 2.13 0
#> 6 30 X1 1.25 1
#> 7 36 X8 1.38 0
#> 8 42 X10 1.40 0
#> 9 48 X2 1.53 0
# Get percentage
AR <- paste0(round(mean(results$max_is_X1) * 100, 0), "%")
print(AR)
#> [1] "22%"