Home > database >  Finding different max between specified rows of different fields
Finding different max between specified rows of different fields

Time:08-02

I'm looking to find the max values of different columns based on specified rows of each column.

My actual data frame is 50K columns and 1K rows so I can't use a loop without greatly increasing run time.

Data Frame:

row V1 V2 V3 V4
1 5 2 4 5
2 3 5 1 6
3 7 3 2 6
4 2 5 3 10
5 6 9 1 2
beg_row <- c(2, 1, 2, 3)
end_row <- c(4, 3, 3, 5)

output:

c(7, 5, 2, 10)

CodePudding user response:

You can try mapply (but I suspect that it won't speed up the runtime if you have massive columns)

> mapply(function(x, y, z) max(x[y:z]), df[-1], beg_row, end_row)
V1 V2 V3 V4
 7  5  2 10

Data

df <- structure(list(row = 1:5, V1 = c(5L, 3L, 7L, 2L, 6L), V2 = c(
  2L,
  5L, 3L, 5L, 9L
), V3 = c(4L, 1L, 2L, 3L, 1L), V4 = c(
  5L, 6L, 6L,
  10L, 2L
)), class = "data.frame", row.names = c(NA, -5L))

beg_row <- c(2, 1, 2, 3)

end_row <- c(4, 3, 3, 5)

CodePudding user response:

An option with dplyr

library(dplyr)
df1 %>% 
  summarise(across(-row, ~ {
   i1 <- match(cur_column(), names(df1)[-1])
     max(.x[beg_row[i1]:end_row[i1]])}))
  V1 V2 V3 V4
1  7  5  2 10

Or another option is to create NA outside the range and then use colMaxs

library(matrixStats)
colMaxs(as.matrix((NA^!(row(df1[-1]) >= beg_row[col(df1[-1])] & 
    row(df1[-1]) <= end_row[col(df1[-1])])) * df1[-1]), na.rm = TRUE)
[1]  7  5  2 10

CodePudding user response:

Another possible solution, based on purrr::pmap_dbl:

library(purrr)

pmap_dbl(list(beg_row, end_row, 2:ncol(df)), ~ max(df[..1:..2, ..3]))

#> [1]  7  5  2 10

CodePudding user response:

With data as in @ThomasIsCoding's answer. Same basic answer as the others, but putting beg_row and end_row in a structure somewhat matching the data it corresponds to.

library(purrr)

rbind(beg_row, end_row) %>% 
  as.data.frame %>% 
  map2_dbl(df[-1], ~ max(.y[exec(seq, !!!.x)]))
#> V1 V2 V3 V4 
#>  7  5  2 10

I think maybe if your start and end are very far apart it may be faster to check if df$row is between them with data.table::between rather than creating a new vector to subset with each time. Not sure though

library(purrr)
library(data.table)

rbind(beg_row, end_row) %>% 
  as.data.frame %>% 
  map2_dbl(df[-1], ~ max(.y[exec(between, df$row, !!!.x)]))
  • Related