How do I gsub the complete time string behind @-CodePudding

(this is my first question, if i need to improve anything about it, pls let me know!)

I am analysing a large observational dataset. start and stop time of each observation have been indicated so that i was able to calculate the duration. But there is a note column which includes information on "pauses" / "breaks" or "out of sight" periods in which the animal was not seen. I would like to subtract those time periods from total duration.

My problem is, one column includes several notes, not only pauses ("HH:MM-HH:MM") but also info on certain events (xy happened "@HH:MM").

I only want to look at time periods in the format of HH:MM-HH:MM and i want to exclude all event times labeled "@HH:MM". I've managed to drop all words and be left with only numbers, so it looks like this

id <- c("3990", "3989", "3004")

timepoints <- c("@6:19,,7:16-7:23,7:25-7:43,@7:53,", "@6:19,,7:25-7:43,@7:53", "7:30-7:39,7:45-7:48,7:49-7:54")

df <- data.frame(id, timepoints)

tried several ways of grep or gsub trying to indicate, either which to keep, or which to leave out but i failed. The closest I got was r dropping "@HH" but keeping ":MM". for this I used

gsub("@([[:digit:]]|[_])*", "", df$timepoints)

, as found for a similar problem just with words here: remove all words that start with "@" from a string

The aim is to get (e.g.):

id	timepoints
3990	"7:16-7:23, 7:25-7:43"

id	timepoints
3990	"7:16-7:23", "7:25-7:43"

If possible separated by comma, or directly separated into different columns so i can extract the time and subtract it from my total observation time.

Any help would be greatly appreciated!

CodePudding user response：

You can do something like this:

f <- function(x) {
  lapply(x, \(s) {
    s = strsplit(s,",")[[1]]
    s[grepl("^\\d",s)]
  })
}

and then apply that function to the timepoints column

library(tidyverse)
mutate(df %>% as_tibble(), timepoints = f(timepoints)) %>% 
  unnest(timepoints)

Output:

  id    timepoints
  <chr> <chr>     
1 3990  7:16-7:23 
2 3990  7:25-7:43 
3 3989  7:25-7:43 
4 3004  7:30-7:39 
5 3004  7:45-7:48 
6 3004  7:49-7:54

You could also use unnest_wider() to get these as columns; for that I would adjust my f() to include the names of the timepoints:

f <- function(x) {
  lapply(x, \(s) {
    s = strsplit(s,",")[[1]]
    s = s[grepl("^\\d",s)]
    setNames(s, paste0("tp", 1:length(s)))
  })
}

library(tidyverse)
mutate(df %>% as_tibble(), timepoints = f(timepoints)) %>% 
  unnest_wider(timepoints)

Output:

  id    tp1       tp2       tp3      
  <chr> <chr>     <chr>     <chr>    
1 3990  7:16-7:23 7:25-7:43 NA       
2 3989  7:25-7:43 NA        NA       
3 3004  7:30-7:39 7:45-7:48 7:49-7:54

CodePudding user response：

How about matching the strings you're interested in instead?

With base:

df$new_timepoints <- regmatches(df$timepoints, gregexpr("\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}", df$timepoints))

Output (with a list column):

    id                        timepoints                  new_timepoints
1 3990 @6:19,,7:16-7:23,7:25-7:43,@7:53,            7:16-7:23, 7:25-7:43
2 3989            @6:19,,7:25-7:43,@7:53                       7:25-7:43
3 3004     7:30-7:39,7:45-7:48,7:49-7:54 7:30-7:39, 7:45-7:48, 7:49-7:54

With tidyverse (in a long format for easy calculations!):

library(stringr)
library(dplyr)
library(tidyr)

df |>
  group_by(id) |>
  mutate(new_timepoints = str_extract_all(timepoints, "\\d{1,2}:\\d{2}-\\d{1,2}:\\d{2}")) |>
  unnest_longer(new_timepoints) |>
  ungroup()

Output:

# A tibble: 6 × 3
  id    timepoints                        new_timepoints
  <chr> <chr>                             <chr>         
1 3990  @6:19,,7:16-7:23,7:25-7:43,@7:53, 7:16-7:23     
2 3990  @6:19,,7:16-7:23,7:25-7:43,@7:53, 7:25-7:43     
3 3989  @6:19,,7:25-7:43,@7:53            7:25-7:43     
4 3004  7:30-7:39,7:45-7:48,7:49-7:54     7:30-7:39     
5 3004  7:30-7:39,7:45-7:48,7:49-7:54     7:45-7:48     
6 3004  7:30-7:39,7:45-7:48,7:49-7:54     7:49-7:54

CodePudding user response：

Setting the data with the package data.table

library(data.table)
id <- c("3990", "3989", "3004")

timepoints <- c("@6:19,,7:16-7:23,7:25-7:43,@7:53,", "@6:19,,7:25-7:43,@7:53", "7:30-7:39,7:45-7:48,7:49-7:54")

df <- data.table(id, timepoints)

Note that I saved it as a data.table

Splitting the timepoints by comma and storing the value in the new_time column.

df[,new_time:=strsplit(timepoints, ",")]

Removing the string values that has @

df[,new_time:=sapply(new_time, function(x) return(x[!grepl("[@]", x)]))]

Since the timepoints column has multiple commas in a row empty string("") exists I remove them

df[,new_time:=sapply(new_time, function(x) return(x[!stringi::stri_isempty(x)]))]

Now the new_time column looks like this

df$new_time
[[1]]
[1] "7:16-7:23" "7:25-7:43"

[[2]]
[1] "7:25-7:43"

[[3]]
[1] "7:30-7:39" "7:45-7:48" "7:49-7:54"

If you want to have the new_time column to have whole strings

df[,new_time:=sapply(new_time, paste, collapse=", ")]
df$new_time
[1] "7:16-7:23, 7:25-7:43"            "7:25-7:43"                       "7:30-7:39, 7:45-7:48, 7:49-7:54"

CodePudding user response：

1) list Split by comma and then grep out the components with a dash. No packages are used. This gives a list of character vectors as the timepoints column.

df2 <- df
df2$timepoints <- lapply(strsplit(df$timepoints, ","), 
  grep, pattern = "-", value = TRUE)

df2
##     id                      timepoints
## 1 3990            7:16-7:23, 7:25-7:43
## 2 3989                       7:25-7:43
## 3 3004 7:30-7:39, 7:45-7:48, 7:49-7:54

str(df2)
'data.frame':   3 obs. of  2 variables:
 $ id        : chr  "3990" "3989" "3004"
 $ timepoints:List of 3
  ..$ : chr  "7:16-7:23" "7:25-7:43"
  ..$ : chr "7:25-7:43"
  ..$ : chr  "7:30-7:39" "7:45-7:48" "7:49-7:54"

2) character If you want a comma separated character string in each row add this:

transform(df2, timepoints = sapply(timepoints, paste, collapse = ","))
##     id                    timepoints
## 1 3990           7:16-7:23,7:25-7:43
## 2 3989                     7:25-7:43
## 3 3004 7:30-7:39,7:45-7:48,7:49-7:54

3) long form or if you prefer long form use this:

long <- with(df2, stack(setNames(timepoints, id))[2:1])
names(long) <- names(df2)
long
##     id timepoints
## 1 3990  7:16-7:23
## 2 3990  7:25-7:43
## 3 3989  7:25-7:43
## 4 3004  7:30-7:39
## 5 3004  7:45-7:48
## 6 3004  7:49-7:54

4) wide form or a wide form matrix:

nr <- nrow(long)
L <- transform(long, seq = ave(1:nr, id, FUN = seq_along))
tapply(L$timepoints, L[c("id", "seq")], c)
##       seq
## id     1           2           3          
##   3990 "7:16-7:23" "7:25-7:43" NA         
##   3989 "7:25-7:43" NA          NA         
##   3004 "7:30-7:39" "7:45-7:48" "7:49-7:54"