Home > Enterprise >  check the validity of a statement in specific rows of a data frame in R
check the validity of a statement in specific rows of a data frame in R

Time:07-30

I have a data frame (nxm) in R, namely A, like this

A <- as.data.frame(matrix(rnorm(50 * 10, mean = 0, sd = 1), 50, 10))
colnames(A) <-
  c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")

For every fifth row in the data (A[1, ], A[6, ], A[11, ], A[16, ],...,A[46, ]), I want see if the column X1 contains the greatest value across all of the columns starting with "X"for that row. In the end, I would like to have a single number reflecting the percentage of these rows for which column X1 contains the largest value across the columns.

CodePudding user response:

Something like:

sA <- A[seq.int(1, nrow(A), by = 5), ]  ## keep rows 1, 6, 11, ...
logi <- rowSums(sA$X1 > sA[-1]) == (ncol(A) - 1)  ## X1 > other columns
mean(logi)  ## the percentage of the above event

In the case where negative values appear, it does not seem to work. I produced a row with X1= -3.12, X2= -3.44, X3=-3.71, X4=-4.34, .... I know X1 is the highest but appears as FALSE in logi. Why is that?

The row you presented in your screenshot is on the 2nd row. You want to skip the 2nd row, don't you? logi is computed on row 1, 6, 11, .... In the end, logi has 10 not 50 logical values. So the 2nd FALSE is not for row 2, but for row 6.

(I added link to the screenshots, because OP removed them in a later edit.)

Now, seems to work perfectly! Thank you for your help!

CodePudding user response:

Here is how we could do it:

  1. Create an id column with rownumber (later important to get the desired rows

  2. Here we filter out row 6, 11, 16, etc... as desired. By using the modulo operator %% we simply check if the row number is divisible by 5 and then add 1

  3. mutate(across(-c(X1, id), ~. < X1, 1, 0)*1) %>% this looks complicated but is not: here we mutate across all columns but X1, and id then within an anonymous function (~) we check if all (.) e.g. columns X2:X10 are smaller X1, if so 1 else 0. The *1 makes 0 1 out of FALSE and TRUE

  4. Now we check if the rowsums X2:X10 are equal to 9 (because 9 rows, if all are 1 eg. TRUE then it should be 9) if yes 1 else 0.

  5. save to dataframe B

  6. Next we use df A, after adding an id column we join with df B

  7. caluclate the AR wtih sum of check column divided by rows in df A *100

  8. slice 1 to pull only one AR

library(dplyr)

B <- A %>% 
  mutate(id = row_number()) %>% 
  filter(row_number()%%5==0 1) %>% 
  mutate(across(-c(X1, id), ~. < X1, 1, 0)*1) %>%
  mutate(x  = rowSums(. [2:10])) %>% 
  mutate(check = ifelse(x==9, 1, 0)) %>% 
  select(id, check) 

A %>% 
  mutate(id = row_number()) %>% 
  left_join(B, by="id") %>% 
  select(-id) %>% 
  mutate(AR = paste0(sum(check, na.rm = TRUE)/nrow(B)*100, "%")) %>% 
  slice(1) %>% 
  pull(AR)
         

CodePudding user response:

We can just extract every fifth row with filter(), turn it into long data structure with pivot_longer(), then pull out variables with the maximum per row with group_by() and slice_max() then create indicators for the variable value.

Example data

You'll need to set.seed to create reproducible data.

library(dplyr)
library(tidyr)
set.seed(123)

A <- as.data.frame(matrix(rnorm(50 * 10, mean = 0, sd = 1), 50, 10))
colnames(A) <-
  c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")

Get desired percentages

# Add row id and only keep 5th rows
A5 <- A %>%
  mutate(row_id = 1:nrow(A)) %>%
  filter(row_number() %% 6 == 0 |
           row_number() == 1)

# Pivot so each variable is its own row
A5_long <- A5 %>%
  pivot_longer(cols = starts_with("X"),
               names_to = "variable",
               values_to = "values")

# Get maximum value for each row then
# check if the variable is X1 and add indicator
results <- A5_long %>%
  group_by(row_id) %>%
  slice_max(values) %>%
  mutate(max_is_X1 = ifelse(variable == "X1", 1, 0))

print(results)
#> # A tibble: 9 x 4
#> # Groups:   row_id [9]
#>   row_id variable values max_is_X1
#>    <int> <chr>     <dbl>     <dbl>
#> 1      1 X5        2.20          0
#> 2      6 X1        1.72          1
#> 3     12 X7        0.757         0
#> 4     18 X10       1.72          0
#> 5     24 X4        2.13          0
#> 6     30 X1        1.25          1
#> 7     36 X8        1.38          0
#> 8     42 X10       1.40          0
#> 9     48 X2        1.53          0

# Get percentage
AR <- paste0(round(mean(results$max_is_X1) * 100, 0), "%")

print(AR)
#> [1] "22%"
  • Related