Home > Enterprise >  Count the number only the first occurence of a sequence with dplyr
Count the number only the first occurence of a sequence with dplyr

Time:10-01

I am struggling with this task: I have this dataframe:

df <- structure(list(col1 = c("A", "A", "A", "B", "A", "A", "C", "A"
)), class = "data.frame", row.names = c(NA, -8L))

  col1
1    A
2    A
3    A
4    B
5    A
6    A
7    C
8    A

I want to get the count of A in the first sequence only.

The expected answer is 3!

Update: expected not working Output:

df %>% 
  summarise(first_sequence_A = sum(col1=="A")) 
# not working because counting all A's

# resluting in:
  first_sequence_A
1                6

expected:
  first_sequence_A
1                3

I prefer a solution with dplyr

I have tried cumsum, rle, lag... but I can't get it!

CodePudding user response:

Not sure what your ideal final output would look like, but maybe something like this?

Edit: probably a better and more succinct way to do this, but...

library(dplyr)
library(data.table)

df %>% 
  mutate(x = rleid(col1)) %>% 
  group_by(col1, x) %>% 
  tally() %>% 
  slice(1) %>% 
  filter(col1 == "A") %>% 
  summarize(first_sequence_A = n)

Gives us:

# A tibble: 1 x 2
  col1  first_sequence_A
  <chr>            <int>
1 A                    3

CodePudding user response:

We can use rle from base R

with(rle(df$col1 == "A"), lengths[values][1])
[1] 3

Or in dplyr syntax

df %>% 
   summarise(first_sequence_A = with(rle(col1 == "A"), lengths[values][1]))
  first_sequence_A
1                3
  • Related