I have a data frame in which each a row is an observation, the last column is called 'overlaps' and shows observations from a different dataset which occur at the same time as the observations in this data frame.
The results I have come from a question I previously asked about how to get overlapping data out of a data frame.
All of these overlapping observations have been concatenated together into a single column as such:
[1] "1_hands:N:1.768,1_hands:N:3.343,2_body:N:14.272"
[2] "1_hands:CH2:4.021,2_body:N:14.272"
[3] "1_hands:N:1.862,2_body:N:4.825"
[4] "1_hands:CH2:1.978,2_body:N:4.825,2_body:CH1:1.075"
[5] "1_hands:CH1:0.821,1_hands:N:1.417,1_hands:N:2.213,2_body:N:5.485"
[6] "1_hands:CH1:3.557,2_body:N:3.519"
[7] "1_hands:CH1:3.557,1_hands:N:1.249,2_body:N:3.519"
[8] "1_hands:CH1:4.896,2_body:CH1:3.308"
[9] "1_hands:CH1:4.896,2_body:CH1:3.308,2_body:N:1.67"
[10] "1_hands:CH1:4.896,2_body:N:1.67,2_body:CH1:5.288"
Each observation is separated by ",". The ":" separates different elements of the observation. For example the observation:
1_hands:N:1.768
would divided up as such:
1_hands = category
N = value
1.768 = duration
What I want to do, is get the sum total duration of each category and value, essentially, I want to add up the durations of every "1_hands:N:X".
One way to do this is with the stringr package, I can use the various str_split functions to continuously break down the observations by delimiters "," and ":", to finally get a column of just the duration values of a particular category and value, which I could then get the sum total of.
However, it's monstrously inefficient, and I have to do this for multiple data sets.
Is there an easier way to do this? Is it possible to loop through that data as such to just get the sum totals I need without breaking it down into multiple sets of data frames?
CodePudding user response:
Not sure what exact efficiency you are trying to achieve, but this solution should be reasonably fast
library(dplyr)
library(data.table)
library(stringr)
library(purrr)
df1 <- your_data[1:5,1]
df2 <- your_data[6:10,1]
myFun <- function(data){
temp <- data.table(vars = data)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed = TRUE)))] %>% na.omit()
temp <- setDT(tstrsplit(temp$vars, ":", fixed = TRUE, names = c("category", "value", "duration")))
}
dt <- list(df1, df2) %>%
purrr::map(~ myFun(.x)) %>%
rbindlist()
dt <- dt[, duration := as.numeric(duration)]
dt_sum <- dt[,.(durSum = sum(duration)), by = c("category", "value")]
CodePudding user response:
Please check below code
data
df <- data.frame(string=c("1_hands:N:1.768,1_hands:N:3.343,2_body:N:14.272",
"1_hands:N:1.768,1_hands:N:3.343,2_body:N:14.272",
"1_hands:CH2:4.021,2_body:N:14.272",
"1_hands:N:1.862,2_body:N:4.825",
"1_hands:CH2:1.978,2_body:N:4.825,2_body:CH1:1.075",
"1_hands:CH1:0.821,1_hands:N:1.417,1_hands:N:2.213,2_body:N:5.485",
"1_hands:CH1:3.557,2_body:N:3.519",
"1_hands:CH1:3.557,1_hands:N:1.249,2_body:N:3.519",
"1_hands:CH1:4.896,2_body:CH1:3.308",
"1_hands:CH1:4.896,2_body:CH1:3.308,2_body:N:1.67",
"1_hands:CH1:4.896,2_body:N:1.67,2_body:CH1:5.288"))
code
df %>%
tidyr::extract(string, into = c('category','value','duration'), regex = '(.*):(.*):(.*)') %>%
group_by(category, value) %>% summarise(duration=sum(as.numeric(duration)))
Created on 2023-01-27 with reprex v2.0.2
output
# A tibble: 5 × 3
# Groups: category [2]
category value duration
<chr> <chr> <dbl>
1 1_hands CH1 22.6
2 1_hands CH2 6.00
3 1_hands N 17.0
4 2_body CH1 13.0
5 2_body N 68.3