Home > Software engineering >  Streamlining code to more efficiently get the mean of two variables based on the binning of another
Streamlining code to more efficiently get the mean of two variables based on the binning of another

Time:10-27

I am relatively new to r and have been racking my head around trying to figure out how to make my code more efficient. This is a subset of the data I have been using:

mydata <- structure(list(Subject = c("1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2"), Interval = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 
21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 1L, 2L, 3L, 
4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 
18L, 19L, 20L), XDistance = c(240, 252.5, 125, 107.5, 170, 77.5, 
105, 157.5, 187.5, 125, 62.5, 187.5, 15, 130, 45, 0, 80, 205, 
97.5, 85, 160, 152.5, 12.5, 107.5, 157.5, 112.5, 102.5, 82.5, 
55, 57.5, 217.5, 235, 142.5, 215, 127.5, 120, 115, 167.5, 182.5, 
147.5, 207.5, 90, 165, 155, 222.5, 140, 175, 72.5, 112.5, 172.5
), YDistance = c(235, 190, 145, 132.5, 210, 92.5, 160, 150, 192.5, 
170, 105, 162.5, 87.5, 170, 80, 12.5, 145, 182.5, 170, 87.5, 
102.5, 122.5, 0, 117.5, 247.5, 195, 145, 167.5, 97.5, 75, 395, 
277.5, 245, 260, 270, 237.5, 235, 275, 245, 210, 200, 92.5, 217.5, 
195, 225, 247.5, 212.5, 135, 187.5, 192.5)), row.names = c(NA, 
-50L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000022000ea9850>)e here

My aim is to get the means of XDistance & YDistance for each subject in 5 interval bins. There are a total of 30 intervals for each subject and I would 6 groups of means per subject. I would then like to create a new table of these means. I was able to do this, however, I did it in a convoluted manner. Here is the code I have used:

mydata <- read.csv("DSR_Practice.csv")
par(mfrow = c(2,2))
library(dplyr)
mydata <- mydata[!(mydata$Subject == "Not Used"), ]

library(data.table)
setDT(mydata)
cuts <- list(c(1,5), c(6,10), c(11,15), c(16, 20), c(21, 25), c(26, 30))
data <- lapply(X = cuts, function(i) {
  mydata[between(x = mydata[ , Interval], lower = i[1], upper = i[2])]
})
Bin1 <- as.data.frame(data[[1]])
Bin1 <- Bin1 %>%
  group_by(Subject) %>%
  summarise(across(XDistance:YDistance, mean, X = "{XDistance}.{mean}", Y = "{YDistance}. {mean}")) %>%
  as.data.frame()
Bin2 <- as.data.frame(data[[2]])
Bin2 <- Bin2 %>%
  group_by(Subject) %>%
  summarise(across(XDistance:YDistance, mean, X = "{XDistance}.{mean}", Y = "{YDistance}. {mean}")) %>%
  as.data.frame()
Bin3 <- as.data.frame(data[[3]])
Bin3 <- Bin3 %>%
  group_by(Subject) %>%
  summarise(across(XDistance:YDistance, mean, X = "{XDistance}.{mean}", Y = "{YDistance}. {mean}")) %>%
  as.data.frame()
Bin4 <- as.data.frame(data[[4]])
Bin4 <- Bin4 %>%
  group_by(Subject) %>%
  summarise(across(XDistance:YDistance, mean, X = "{XDistance}.{mean}", Y = "{YDistance}. {mean}")) %>%
  as.data.frame()
Bin5 <- as.data.frame(data[[5]])
Bin5 <- Bin5 %>%
  group_by(Subject) %>%
  summarise(across(XDistance:YDistance, mean, X = "{XDistance}.{mean}", Y = "{YDistance}. {mean}")) %>%
  as.data.frame()
Bin6 <- as.data.frame(data[[6]])
Bin6 <- Bin6 %>%
  group_by(Subject) %>%
  summarise(across(XDistance:YDistance, mean, X = "{XDistance}.{mean}", Y = "{YDistance}. {mean}")) %>%
  as.data.frame()


df.list <- list(Bin1, Bin2, Bin3, Bin4, Bin5, Bin6)
df.list <- Reduce(function(x, y) merge(x, y, all = TRUE), df.list)

Any help would be greatly appreciated!

I was trying to use a for loop create these bins and also tried to use lapply to get the means of a list of all bins but I can't seem to figure it out.

CodePudding user response:

You could create a variable called bin using the function cut and do groupwise calculations using this:

mydata %>%
  mutate(bin = as.numeric(cut(Interval, seq(0, 30, 5)   0.5))) %>%
  group_by(Subject, bin) %>%
  summarise_all(mean) %>%
  select(-Interval)
#> # A tibble: 10 x 4
#> # Groups:   Subject [2]
#>    Subject   bin XDistance YDistance
#>    <chr>   <dbl>     <dbl>     <dbl>
#>  1 1           1     179        182.
#>  2 1           2     130.       153 
#>  3 1           3      88        121 
#>  4 1           4      93.5      120.
#>  5 1           5     118        118 
#>  6 1           6      82        136 
#>  7 2           1     188.       290.
#>  8 2           2     146.       240.
#>  9 2           3     168        186 
#> 10 2           4     134.       195

Created on 2022-10-26 with reprex v2.0.2

CodePudding user response:

I think this dplyr approach produces what you're looking for, although there may be a neater way to create your bins than the case_when series I used.

library(dplyr)
mydata %>%
  group_by(Subject) %>%
  mutate(Bin = case_when(
    Interval >= 1 & Interval <= 5 ~ 1,
    Interval >=6 & Interval <=10 ~ 2,
    Interval >= 11 & Interval <=15 ~ 3,
    Interval >=16 & Interval <=20 ~ 4,
    Interval >=21 & Interval <=25 ~ 5,
    Interval >=26 & Interval <=30 ~ 6,
    TRUE ~ 0
  )) %>%
  group_by(Subject, Bin) %>%
  summarize(XAvg = mean(XDistance),
            YAvg = mean(YDistance))
  •  Tags:  
  • r
  • Related