I have a large dataframe of true (1) and false (0) information. Below is a reduced sample of it.
df <- read.table(text = " 0m-10m 0m-15m 0m-20m 0m-25m
X 1 0 1 1
Y 1 1 0 0
Z 0 1 0 1 ", header = T)
I need to subset rows where two consecutive contrasts are true (both == 1) so that the result looks like this
0m-10m 0m-15m 0m-20m 0m-25m
X 1 0 1 1
Y 1 1 0 0
I can filter by rowSums >=2, but this is also true for row Z, which I need to exclude. Thoughts?
CodePudding user response:
Just remove the first and last columns, create a logical matrix with &
and then use rowSums
to create logical vector to subset
df[rowSums(df[-1] & df[-ncol(df)]) > 0,]
-output
0m-10m 0m-15m 0m-20m 0m-25m
X 1 0 1 1
Y 1 1 0 0
The rowSums
above is not based on a single data. We are creating a logical matrix from two equal sized datasets by removing the first and last columns and then using &
so, that if both have 1 in the same location, only it will return TRUE
and FALSE otherwise. rowSums
on this matrix returns the sum of TRUE (or TRUE/FALSE -> 1/0)
> df[-1] & df[-ncol(df)]
0m-15m 0m-20m 0m-25m
X FALSE FALSE TRUE
Y TRUE FALSE FALSE
Z FALSE FALSE FALSE
> rowSums(df[-1] & df[-ncol(df)])
X Y Z
1 1 0
> rowSums(df[-1] & df[-ncol(df)]) > 0
X Y Z
TRUE TRUE FALSE
Or if we are looking for a general case, we may use rle
- run-length-encoding on each row by looping over the rows with apply
and MARGIN = 1
. The rle
returns a list
of values
and lengths
for each adjacent similar values. Then, we create a logical vector based on the lengths
and values
i.e. if the 'values' is 1 and 'lengths' is 2.
n <- 2
df[apply(df, 1, FUN = function(x) with(rle(x), any(lengths == n & values))),]
0m-10m 0m-15m 0m-20m 0m-25m
X 1 0 1 1
Y 1 1 0 0
-code breakup
> apply(df, 1, FUN = rle)
$X
Run Length Encoding
lengths: Named int [1:3] 1 1 2
- attr(*, "names")= chr [1:3] "0m-15m" "0m-20m" ""
values : Named int [1:3] 1 0 1
- attr(*, "names")= chr [1:3] "0m-10m" "0m-15m" "0m-25m"
$Y
Run Length Encoding
lengths: Named int [1:2] 2 2
- attr(*, "names")= chr [1:2] "0m-20m" ""
values : Named int [1:2] 1 0
- attr(*, "names")= chr [1:2] "0m-15m" "0m-25m"
$Z
Run Length Encoding
lengths: Named int [1:4] 1 1 1 1
- attr(*, "names")= chr [1:4] "0m-15m" "0m-20m" "0m-25m" ""
values : Named int [1:4] 0 1 0 1
- attr(*, "names")= chr [1:4] "0m-10m" "0m-15m" "0m-20m" "0m-25m"
> apply(df, 1, FUN = function(x) with(rle(x),lengths == n & values))
$X
0m-15m 0m-20m
FALSE FALSE TRUE
$Y
0m-20m
TRUE FALSE
$Z
0m-15m 0m-20m 0m-25m
FALSE FALSE FALSE FALSE
> apply(df, 1, FUN = function(x) with(rle(x), any(lengths == n & values)))
X Y Z
TRUE TRUE FALSE
CodePudding user response:
Here is an alternative way using pivot:
library(dplyr)
library(tidyr)
df %>%
rownames_to_column("xyz") %>%
pivot_longer(
-xyz
) %>%
group_by(xyz) %>%
mutate(helper = lag(value),
flag = ifelse(value==1 & helper==1, 1,0)) %>%
filter(any(flag==1)) %>%
pivot_wider(
names_from = name,
values_from = value,
values_fill = 0
) %>%
summarize(across(starts_with("X"), sum)) %>%
column_to_rownames("xyz")
X0m.10m X0m.15m X0m.20m X0m.25m
X 1 0 1 1
Y 1 1 0 0
CodePudding user response:
A solution based on the creation of an auxiliary column concatenating all original columns as a string (using tidyr::unite
) and then using stringr::str_detect
on the strings:
library(tidyverse)
df <- read.table(text = " 0m-10m 0m-15m 0m-20m 0m-25m
X 1 0 1 1
Y 1 1 0 0
Z 0 1 0 1 ", header = T)
df %>%
unite(aux, sep = "", remove = F) %>%
filter(str_detect(aux, "11")) %>%
select(-aux)
#> X0m.10m X0m.15m X0m.20m X0m.25m
#> X 1 0 1 1
#> Y 1 1 0 0