sum function in R data frame not capturing correct data-CodePudding

I have a dataframe that looks like this:

   channel      start.time stop.time  vp  duration X  id  overlaps
1: 4_speech     14.183     16.554     CH1 2.371    NA 165 1_hands_CH1_1.145;2_body_CH1_1.883;2_body_N_14.272;2_body_N_4.825
2: 4_speech     21.196     22.259     CH1 1.063    NA 166 1_hands_N_1.417;2_body_CH1_1.075;2_body_N_5.485
3: 4_speech     28.001     31.518     CH1 3.517    NA 167 1_hands_CH1_3.557;1_hands_CH2_1.75;2_body_CH1_3.445;2_body_N_3.519
4: 4_speech     34.867     36.549     CH1 1.682    NA 168 2_body_CH1_3.308
5: 4_speech     41.019     42.265     CH1 1.246    NA 169 1_hands_CH1_4.896;1_hands_N_0.663;2_body_CH1_5.288
6: 4_speech     55.262     57.800     CH1 2.538    NA 170 2_body_CH1_2.494;2_body_N_6.571

The first 6 columns show information about a particular observation, the 7th column 'overlaps' shows a list of other observations from a different data-frame that co-occur with the observations in this data frame. Each of the observations in the overlaps column is structured like this: 'channel_vp_duration'. For example, the first observation in 'overlaps' in row 1 shows that '1_hands' is the channel, 'CH1' is vp (a kind of value), and the 1.145 the duration of that observation.

I want a sum of all the durations for a given observation type. I can sort of get this with the following code that was adapted from an answer provided by a stack user on a question I previously asked about how to get the overlaps data in the first place.

library(data.table)
library(stringr)
setDT(speech_rows)
speech_rows[, id := .I]
setkey(speech_rows, id)
#self join on subset by row
speech_rows[speech_rows, durs := { 
  temp <- df[!id == i.id & start.time < i.stop.time & stop.time > i.start.time & channel == "1_hands", ]
  sum(temp$duration)
  }, by = .EACHI]

This adds another columns 'durs' which is supposed to show the total duration of all the numeric values attached to a '1_hands' string in the overlaps column. Thus producing the following (first 6 columns removed to save space):

 overlaps                                                              durs
1: 1_hands_CH1_1.145;2_body_CH1_1.883;2_body_N_14.272;2_body_N_4.825   0.000
2: 1_hands_N_1.417;2_body_CH1_1.075;2_body_N_5.485                     1.417
3: 1_hands_CH1_3.557;1_hands_CH2_1.75;2_body_CH1_3.445;2_body_N_3.519  1.750
4: 2_body_CH1_3.308                                                    0.000
5: 1_hands_CH1_4.896;1_hands_N_0.663;2_body_CH1_5.288                  5.559
6: 2_body_CH1_2.494;2_body_N_6.571                                     0.000

But there is a problem, the sum() function does not capture all of the relevant strings. In row 1, there is the string: "1_hands_CH1_1.145", it is the only '1_hands' string in that row, so the value under durs for row 1 should be '1.145'. But the function ignores it for some reason. In row 2, the durs sum is correct. In row 3, it counts only one of the 1_hands values, and ignores the other. In row 5, it correctly finds both of the 1_hands values and adds them together. Rows 4 and 6 are have correct 'durs' values because there are no 1_hands observations in them.

This is very strange, and I don't know it correctly detects the numeric values at some times but not at others. This is problem #1

Problem #2: I cannot specify what I want beyond '1_hands', what I really want to do is get the sum of durations for all 1_hands_CH1 values, NOT all 1_hands values. To do this, I assume that you would just need to change the strings in 'channel == 1_hands'

 temp <- df[!id == i.id & start.time < i.stop.time & stop.time > i.start.time & channel == **"1_hands"**, ]

But if I change it to something like "1_hands_CH1" all of the durs values will be zero, it can't anything past '1_hands'.

So in sum, I want to know why the math isn't working like I want it to, and why I can't select more specific strings.

CodePudding user response：

Here is one way you could get durations out of your overlaps column using the tidyverse. You can set text_string equal to what you want durations for. I have provided some examples of how to enter your text string. The example below returns durations for all "1_hands" observations. If you wanted durations just for the "1_hands_CH1", then you would just set text_string <- "1_hands_CH1".

# Load tidyverse
library(tidyverse)

# Set text_string Equal To Specific String You Want Durations For
text_string <- "1_hands_[A-Z0-9] " 

# Examples For text_string
# text_string <- "1_hands_CH1" ## example for getting 1_hands_CH1
# text_string <- ""2_body_N" ## example for getting 2_body_N
# text_string <- "1_hands_[A-Z0-9] " ## example for getting all 1_hands
# text_string <- "2_body_[A-Z0-9] " ## example for getting all 2_body

# df With Durations
df_with_durs <- df %>%
    as_tibble() %>%
    mutate(str_matches = str_match_all(overlaps, str_glue("{text_string}_[0-9.] ")),
           durs = map(str_matches,
                          function(x) {
                              durs <- str_remove(x, str_glue("{text_string}_"))
                              num_durs <- as.numeric(durs)
                              sum_durs <- sum(num_durs) 
                              return(sum_durs)
                              }
                          )
           ) %>% 
    unnest(cols = durs) %>%
    select(-str_matches)

# View Output
df_with_durs

# channel  start.time stop.time vp    duration X        id overlaps                                                            durs
# <chr>         <dbl>     <dbl> <chr>    <dbl> <lgl> <int> <chr>                                                              <dbl>
# 4_speech       14.2      16.6 CH1       2.37 NA      165 1_hands_CH1_1.145;2_body_CH1_1.883;2_body_N_14.272;2_body_N_4.825   1.14
# 4_speech       21.2      22.3 CH1       1.06 NA      166 1_hands_N_1.417;2_body_CH1_1.075;2_body_N_5.485                     1.42
# 4_speech       28.0      31.5 CH1       3.52 NA      167 1_hands_CH1_3.557;1_hands_CH2_1.75;2_body_CH1_3.445;2_body_N_3.519  5.31
# 4_speech       34.9      36.5 CH1       1.68 NA      168 2_body_CH1_3.308                                                    0   
# 4_speech       41.0      42.3 CH1       1.25 NA      169 1_hands_CH1_4.896;1_hands_N_0.663;2_body_CH1_5.288                  5.56
# 4_speech       55.3      57.8 CH1       2.54 NA      170 2_body_CH1_2.494;2_body_N_6.571                                     0