In R, get average of decimals in a string marked by % sign-CodePudding

For a list of products, I need to get the simple average of the percentage values in a column of strings rates and save them in a new column average. How do I do this for example with the below dataframe of sample entries?

rates <-c("7.5% ( 01-01 to 31-01 ) ; 0% ( 01-02 to 30-06 ) ; 7.5% ( 01-07 to 31-12 )", "0% (01 FEB - 30 JUN); 3.70% (01-01 to 31-01) ; 3.70% (01-07 to 31-12)")

product <-c("001","002")

df <- data.frame(product,rates)

#this needs to get the simple average of all the percentage values in the string 
df$average <- average(as.numeric(grepl("\\d \\%",rates)))

So the result would be:

df$average <- c(5,2.5)
df

CodePudding user response：

First, let's fix the data to have a row per sample:

rates <-c("7.5% ( 01-01 to 31-01 ) ; 0% ( 01-02 to 30-06 ) ; 7.5% ( 01-07 to 31-12 )", "0% (01 FEB - 30 JUN); 3.70% (01-01 to 31-01) ; 3.70% (01-07 to 31-12)")
# Unlist the two values in the list, collapse everything with ';' to keep the pattern,
# and apply strplit() to separate by ';'
rates <- strsplit(paste0(unlist(rates),collapse=';'),';')

# Reply 3 times
product <-c("001","002")
product = sort(rep(product,3))

df <- data.frame(product,rates)
names(df) = c('product','rates')
df

Which now:

  product                    rates
1     001 7.5% ( 01-01 to 31-01 ) 
2     001   0% ( 01-02 to 30-06 ) 
3     001  7.5% ( 01-07 to 31-12 )
4     002     0% (01 FEB - 30 JUN)
5     002  3.70% (01-01 to 31-01)

Now, let's extract the percentages from rates column

# Replace everything after the % with nothing using gsub
df$percentage = as.double(gsub("%.*",'',df$rates))
df

  product                    rates percentage
1     001 7.5% ( 01-01 to 31-01 )         7.5
2     001   0% ( 01-02 to 30-06 )         0.0
3     001  7.5% ( 01-07 to 31-12 )        7.5
4     002     0% (01 FEB - 30 JUN)        0.0
5     002  3.70% (01-01 to 31-01)         3.7
6     002   3.70% (01-07 to 31-12)        3.7
6     002   3.70% (01-07 to 31-12)

Finally, the average by product:

average = aggregate(df$percentage, list(df$product), FUN=mean)
average

  Group.1        x
1     001 5.000000
2     002 2.466667

CodePudding user response：

You could make a small function ,f that extracts the mean (using stringr::str_extract_all(), and apply f to rates, rowwise:

library(dplyr)
library(stringr)

f <- function(s) mean(as.numeric(str_extract_all(s, "[0-9.]*(?=%)")[[1]]), na.rm=T)
df %>% rowwise() %>% mutate(average = f(rates))

Output:

  product rates                                                                     average
  <chr>   <chr>                                                                       <dbl>
1 001     7.5% ( 01-01 to 31-01 ) ; 0% ( 01-02 to 30-06 ) ; 7.5% ( 01-07 to 31-12 )    5   
2 002     0% (01 FEB - 30 JUN); 3.70% (01-01 to 31-01) ; 3.70% (01-07 to 31-12)        2.47

CodePudding user response：

In base R we can use regmatches/regexpr to extract the numeric part before the % and get the mean after converting to numeric

df$average <- sapply(regmatches(df$rates, gregexpr("[0-9.] (?=%)",
   df$rates, perl = TRUE)), function(x) mean(as.numeric(x)))
df$average
[1] 5.000000 2.466667

CodePudding user response：

using base R:

vasl <- gsub("%[^;] ", '',df$rates)
df$average <- rowMeans(read.table(text=vals, sep=';',fill= TRUE), na.rm = TRUE)

df$average
[1] 5.000000 2.466667

using tidyverse:

df %>%
  mutate(r = str_remove_all(rates,'%[^;] '))%>%
  separate_rows(r, convert = TRUE)%>%
  group_by(product, rates)%>%
  summarise(av = mean(r), .groups = 'drop')


     product rates                                                                        av
  <chr>   <chr>                                                                     <dbl>
1 001     7.5% ( 01-01 to 31-01 ) ; 0% ( 01-02 to 30-06 ) ; 7.5% ( 01-07 to 31-12 )  5   
2 002     0% (01 FEB - 30 JUN); 3.70% (01-01 to 31-01) ; 3.70% (01-07 to 31-12)      2.47