I am counting the events with multiple conditions and my final result is represented for each year seperately. The example of my df:
year <- c(rep(1981,20))
k1 <- c(rep(NA,5),rep("COLD",4),rep(NA,4),"COLD",NA,"COLD",rep(NA,4))
k2 <- c(rep(NA,10),rep("COLD",2),rep(NA,8))
k3 <- c(rep(NA,3),"COLD",rep(NA,16))
k4 <- c(rep(NA,3),rep("COLD",5),rep(NA,2),rep("COLD",6),NA,rep("COLD",3))
k5 <- c(rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,3),"COLD",rep(NA,8))
df <- data.frame(year,k1,k2,k3,k4,k5)
The code I use is below:
rle_col <- function(k_col) {
with(rle(is.na(k_col)), {
i1 <- values
i1[values & lengths <= 2] <- 'Invalid'
sum(!values & lengths >= 5 &
(lead(i1) != "Invalid" & lag(i1)>=1), na.rm = TRUE)
})
}
rezult <- df %>%
group_by(year) %>%
summarise(across(starts_with('k'), rle_col))
My code works well, but I need to get one more variable - the length of each event (or the number of members in each event). This means that instead of sum here sum(!values & lengths >= 5 & (lead(i1) != "Invalid" & lag(i1)>=1), na.rm = TRUE)
I need to get the number of values within each event, which satisfies these conditions. Is it possible? Thank you for any help.
The result I would like to see should be as follows:
Year k4
1981 5
1981 10
CodePudding user response:
There have to be easy solutions for this problem. Here is a ridiculous complicated (and not well performing) approach using tidyverse
:
library(tidyr)
library(dplyr)
library(purrr)
map_df(
names(df)[grepl("^k", names(df))],
~df %>%
group_by(year) %>%
mutate(
grp = (!!sym(.x) == "COLD" & !is.na(!!sym(.x))) |
!(is.na(lead(!!sym(.x))) | is.na(lag(!!sym(.x))))
) %>%
group_by(year, grp2 = cumsum(!grp)) %>%
filter(grp) %>%
count(name = .x) %>%
filter(!!sym(.x) >= 5) %>%
ungroup() %>%
select(-grp2)
)
This returns
# A tibble: 2 x 6
year k1 k2 k3 k4 k5
<dbl> <int> <int> <int> <int> <int>
1 1981 NA NA NA 5 NA
2 1981 NA NA NA 10 NA
Issues:
- It's not performing well.
- If there are several groups in different columns the output looks like this:
# A tibble: 9 x 6
year k1 k2 k3 k4 k5
<dbl> <int> <int> <int> <int> <int>
1 1981 4 NA NA NA NA
2 1981 3 NA NA NA NA
3 1981 NA 2 NA NA NA
4 1981 NA NA 1 NA NA
5 1981 NA NA NA 5 NA
6 1981 NA NA NA 10 NA
7 1981 NA NA NA NA 1
8 1981 NA NA NA NA 1
9 1981 NA NA NA NA 1
At the moment I don't know how to change it into something like
# A tibble: 2 x 6
year k1 k2 k3 k4 k5
<dbl> <int> <int> <int> <int> <int>
1 1981 4 2 1 5 1
2 1981 3 NA NA 10 1