Calculate mean of levels interval-CodePudding

I have levels that I wish to calculate the mean of. Do you have to use gsub and replace characters or are there another way?

# Reproduce data
x <- c("(-48.2,-47.8]", "(-61.9,-61.5]", "(-52.2,-51.8]", "(-43.7,-43.3]", "(-51.4,-51]", "(-43.3,-42.9]", "(-43.7,-43.3]", "(-47.4,-47]")

# I have data on the form as below
X <- as.factor(x)

# I want the mean of e.g X[1]
# mean(X[1]) = mean(-48.2   -47.8)

CodePudding user response：

You could also try this approach using dplyr() to preserve all the numbers:

library(dplyr)
library(tidyr)


data.frame(x) %>% separate(x, into = c("num1", "num2"), sep = ",") %>%
  mutate(num1 = as.numeric(gsub("[()]|[][]", "", num1)),
         num2 = as.numeric(gsub("[()]|[][]", "", num2)),
         mean = (num1   num2) / 2)

Output:

#    num1  num2  mean
# 1 -48.2 -47.8 -48.0
# 2 -61.9 -61.5 -61.7
# 3 -52.2 -51.8 -52.0
# 4 -43.7 -43.3 -43.5
# 5 -51.4 -51.0 -51.2
# 6 -43.3 -42.9 -43.1
# 7 -43.7 -43.3 -43.5
# 8 -47.4 -47.0 -47.2

CodePudding user response：

I think a three-step process of gsub (to remove what we don't want/need), strsplit (to separate the numbers by comma), and mean(as.numeric(.)) (to actually calculate the numeric average) should work:

gsub("[^-0-9.,]", "", x)
# [1] "-48.2,-47.8" "-61.9,-61.5" "-52.2,-51.8" "-43.7,-43.3" "-51.4,-51"   "-43.3,-42.9" "-43.7,-43.3" "-47.4,-47"  
strsplit(gsub("[^-0-9.,]", "", x), ",")
# [[1]]
# [1] "-48.2" "-47.8"
# [[2]]
# [1] "-61.9" "-61.5"
# [[3]]
# [1] "-52.2" "-51.8"
# [[4]]
# [1] "-43.7" "-43.3"
# [[5]]
# [1] "-51.4" "-51"  
# [[6]]
# [1] "-43.3" "-42.9"
# [[7]]
# [1] "-43.7" "-43.3"
# [[8]]
# [1] "-47.4" "-47"  
sapply(strsplit(gsub("[^-0-9.,]", "", x), ","), function(z) mean(as.numeric(z)))
# [1] -48.0 -61.7 -52.0 -43.5 -51.2 -43.1 -43.5 -47.2

(I should note that I'm interpreting your mean(-48.2 -47.8) to really mean mean(c(-48.2, -47.8)), since otherwise -48.2 -47.8 seems not right.)

CodePudding user response：

1) Assuming that what is wanted is the mean of the two numbers in each component of X, remove the first and last character and read what is left using read.table creating a data frame in which each row is formed from one component of X. Finally use rowMeans on that.

No packages are used.

rowMeans(read.table(text = sub(".(.*).", "\\1", X), sep = ","))
## [1] -48.0 -61.7 -52.0 -43.5 -51.2 -43.1 -43.5 -47.2

This can also be written as a pipeline:

X |> 
  sub(".(.*).", "\\1", x = _) |>
  read.table(text = _, sep = ",") |>
  rowMeans()
## [1] -48.0 -61.7 -52.0 -43.5 -51.2 -43.1 -43.5 -47.2

2) A similar approach using strapply also works. This applies the indicated function, expressed using formula notation, to the capture groups.

library(gsubfn)

strapply(format(X), "^.(.*),(.*).$", ~ mean(as.numeric(c(x, y))), simplify = TRUE)
## [1] -48.0 -61.7 -52.0 -43.5 -51.2 -43.1 -43.5 -47.2

CodePudding user response：

sapply(regmatches(x, gregexpr('[-0-9.] ', x)), \(x) mean(as.numeric(x)))

# [1] -48.0 -61.7 -52.0 -43.5 -51.2 -43.1 -43.5 -47.2