Home > Software engineering >  R: simpler ways of splitting columns and getting sums of category values
R: simpler ways of splitting columns and getting sums of category values

Time:01-28

I have a data frame in which each a row is an observation, the last column is called 'overlaps' and shows observations from a different dataset which occur at the same time as the observations in this data frame.

The results I have come from a question I previously asked about how to get overlapping data out of a data frame.

All of these overlapping observations have been concatenated together into a single column as such:

 [1] "1_hands:N:1.768,1_hands:N:3.343,2_body:N:14.272"                                                                                                                                  
 [2] "1_hands:CH2:4.021,2_body:N:14.272"                                                                                                                                                
 [3] "1_hands:N:1.862,2_body:N:4.825"                                                                                                                                                   
 [4] "1_hands:CH2:1.978,2_body:N:4.825,2_body:CH1:1.075"                                                                                                                                
 [5] "1_hands:CH1:0.821,1_hands:N:1.417,1_hands:N:2.213,2_body:N:5.485"                                                                                                                 
 [6] "1_hands:CH1:3.557,2_body:N:3.519"                                                                                                                                                 
 [7] "1_hands:CH1:3.557,1_hands:N:1.249,2_body:N:3.519"                                                                                                                                 
 [8] "1_hands:CH1:4.896,2_body:CH1:3.308"                                                                                                                                               
 [9] "1_hands:CH1:4.896,2_body:CH1:3.308,2_body:N:1.67"                                                                                                                                 
[10] "1_hands:CH1:4.896,2_body:N:1.67,2_body:CH1:5.288"

Each observation is separated by ",". The ":" separates different elements of the observation. For example the observation:

1_hands:N:1.768

would divided up as such:

1_hands = category

N = value

1.768 = duration

What I want to do, is get the sum total duration of each category and value, essentially, I want to add up the durations of every "1_hands:N:X".

One way to do this is with the stringr package, I can use the various str_split functions to continuously break down the observations by delimiters "," and ":", to finally get a column of just the duration values of a particular category and value, which I could then get the sum total of.

However, it's monstrously inefficient, and I have to do this for multiple data sets.

Is there an easier way to do this? Is it possible to loop through that data as such to just get the sum totals I need without breaking it down into multiple sets of data frames?

CodePudding user response:

Not sure what exact efficiency you are trying to achieve, but this solution should be reasonably fast

library(dplyr)
library(data.table)
library(stringr)
library(purrr)

df1 <- your_data[1:5,1]
df2 <- your_data[6:10,1]

myFun <- function(data){
  temp <- data.table(vars = data)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed = TRUE)))] %>% na.omit()
  temp <- setDT(tstrsplit(temp$vars, ":", fixed = TRUE, names = c("category", "value", "duration")))
}

dt <- list(df1, df2) %>%
  purrr::map(~ myFun(.x)) %>%
  rbindlist()
dt <- dt[, duration := as.numeric(duration)]

dt_sum <- dt[,.(durSum = sum(duration)), by = c("category", "value")]

CodePudding user response:

Please check below code

data

df <- data.frame(string=c("1_hands:N:1.768,1_hands:N:3.343,2_body:N:14.272",
                          "1_hands:N:1.768,1_hands:N:3.343,2_body:N:14.272",                                                                                                                                  
                          "1_hands:CH2:4.021,2_body:N:14.272",                                                                                                                                                
                          "1_hands:N:1.862,2_body:N:4.825",                                                                                                                                                   
                          "1_hands:CH2:1.978,2_body:N:4.825,2_body:CH1:1.075",                                                                                                                                
                          "1_hands:CH1:0.821,1_hands:N:1.417,1_hands:N:2.213,2_body:N:5.485",                                                                                                                 
                          "1_hands:CH1:3.557,2_body:N:3.519",                                                                                                                                                 
                          "1_hands:CH1:3.557,1_hands:N:1.249,2_body:N:3.519",                                                                                                                                 
                          "1_hands:CH1:4.896,2_body:CH1:3.308",                                                                                                                                               
                          "1_hands:CH1:4.896,2_body:CH1:3.308,2_body:N:1.67",                                                                                                                                 
                          "1_hands:CH1:4.896,2_body:N:1.67,2_body:CH1:5.288"))

code

df %>% 
  tidyr::extract(string, into = c('category','value','duration'), regex = '(.*):(.*):(.*)') %>% 
  group_by(category, value) %>% summarise(duration=sum(as.numeric(duration)))

Created on 2023-01-27 with reprex v2.0.2

output

# A tibble: 5 × 3
# Groups:   category [2]
  category value duration
  <chr>    <chr>    <dbl>
1 1_hands  CH1      22.6 
2 1_hands  CH2       6.00
3 1_hands  N        17.0 
4 2_body   CH1      13.0 
5 2_body   N        68.3 

  • Related