I have a data frame where there a several columns with prefix col
:
df <- data.frame(row_id = c(100, 101, 102, 103, 104, 105, 106, 107, 108, 109),
col1 = c(1,0,1,1,1,0,0,1,1,0),
col2 = c(1,1,1,0,0,1,1,1,0,0),
col3 = c(0,0,1,0,0,1,1,1,1,0),
col4 = c(1,1,1,0,0,1,0,1,1,1),
col5 = c(1,1,0,0,1,0,1,0,0,1))
I want to be able to create a separate table where the first column is row_id, and for every row, I want to get the names of the first and last column for every run of 1. For example, the output for row_id = 100
would be:
| row_id | first_col | last_col |
| 100 | col1 | col2 |
| 100 | col4 | col5 |
I'll need to apply this method to a much larger data frame, about a thousand columns per row, so any help on that would be much appreaciated.
Thank you!
CodePudding user response:
Remove if (.N > 1)
to include length-1 "run"s
df <- data.frame(row_id = c(100, 101, 102, 103, 104, 105, 106, 107, 108, 109),
col1 = c(1,0,1,1,1,0,0,1,1,0),
col2 = c(1,1,1,0,0,1,1,1,0,0),
col3 = c(0,0,1,0,0,1,1,1,1,0),
col4 = c(1,1,1,0,0,1,0,1,1,1),
col5 = c(1,1,0,0,1,0,1,0,0,1))
library(data.table)
library(magrittr)
setDT(df) %>%
melt('row_id') %>%
.[, rid := rleid(value), keyby = row_id] %>%
.[value == 1,
if (.N > 1)
.(first_col = first(variable),
last_col = last(variable)),
by = .(row_id, rid)] %>%
.[, -'rid']
#> row_id first_col last_col
#> 1: 100 col1 col2
#> 2: 100 col4 col5
#> 3: 101 col4 col5
#> 4: 102 col1 col4
#> 5: 105 col2 col4
#> 6: 106 col2 col3
#> 7: 107 col1 col4
#> 8: 108 col3 col4
#> 9: 109 col4 col5
Created on 2021-11-24 by the reprex package (v2.0.1)