For a list of products
, I need to get the simple average of the percentage values in a column of strings rates
and save them in a new column average
. How do I do this for example with the below dataframe of sample entries?
rates <-c("7.5% ( 01-01 to 31-01 ) ; 0% ( 01-02 to 30-06 ) ; 7.5% ( 01-07 to 31-12 )", "0% (01 FEB - 30 JUN); 3.70% (01-01 to 31-01) ; 3.70% (01-07 to 31-12)")
product <-c("001","002")
df <- data.frame(product,rates)
#this needs to get the simple average of all the percentage values in the string
df$average <- average(as.numeric(grepl("\\d \\%",rates)))
So the result would be:
df$average <- c(5,2.5)
df
CodePudding user response:
First, let's fix the data to have a row per sample:
rates <-c("7.5% ( 01-01 to 31-01 ) ; 0% ( 01-02 to 30-06 ) ; 7.5% ( 01-07 to 31-12 )", "0% (01 FEB - 30 JUN); 3.70% (01-01 to 31-01) ; 3.70% (01-07 to 31-12)")
# Unlist the two values in the list, collapse everything with ';' to keep the pattern,
# and apply strplit() to separate by ';'
rates <- strsplit(paste0(unlist(rates),collapse=';'),';')
# Reply 3 times
product <-c("001","002")
product = sort(rep(product,3))
df <- data.frame(product,rates)
names(df) = c('product','rates')
df
Which now:
product rates
1 001 7.5% ( 01-01 to 31-01 )
2 001 0% ( 01-02 to 30-06 )
3 001 7.5% ( 01-07 to 31-12 )
4 002 0% (01 FEB - 30 JUN)
5 002 3.70% (01-01 to 31-01)
Now, let's extract the percentages from rates
column
# Replace everything after the % with nothing using gsub
df$percentage = as.double(gsub("%.*",'',df$rates))
df
product rates percentage
1 001 7.5% ( 01-01 to 31-01 ) 7.5
2 001 0% ( 01-02 to 30-06 ) 0.0
3 001 7.5% ( 01-07 to 31-12 ) 7.5
4 002 0% (01 FEB - 30 JUN) 0.0
5 002 3.70% (01-01 to 31-01) 3.7
6 002 3.70% (01-07 to 31-12) 3.7
6 002 3.70% (01-07 to 31-12)
Finally, the average by product:
average = aggregate(df$percentage, list(df$product), FUN=mean)
average
Group.1 x
1 001 5.000000
2 002 2.466667
CodePudding user response:
You could make a small function ,f
that extracts the mean (using stringr::str_extract_all()
, and apply f
to rates
, rowwise:
library(dplyr)
library(stringr)
f <- function(s) mean(as.numeric(str_extract_all(s, "[0-9.]*(?=%)")[[1]]), na.rm=T)
df %>% rowwise() %>% mutate(average = f(rates))
Output:
product rates average
<chr> <chr> <dbl>
1 001 7.5% ( 01-01 to 31-01 ) ; 0% ( 01-02 to 30-06 ) ; 7.5% ( 01-07 to 31-12 ) 5
2 002 0% (01 FEB - 30 JUN); 3.70% (01-01 to 31-01) ; 3.70% (01-07 to 31-12) 2.47
CodePudding user response:
In base R
we can use regmatches/regexpr
to extract the numeric part before the %
and get the mean
after converting to numeric
df$average <- sapply(regmatches(df$rates, gregexpr("[0-9.] (?=%)",
df$rates, perl = TRUE)), function(x) mean(as.numeric(x)))
df$average
[1] 5.000000 2.466667
CodePudding user response:
using base R:
vasl <- gsub("%[^;] ", '',df$rates)
df$average <- rowMeans(read.table(text=vals, sep=';',fill= TRUE), na.rm = TRUE)
df$average
[1] 5.000000 2.466667
using tidyverse:
df %>%
mutate(r = str_remove_all(rates,'%[^;] '))%>%
separate_rows(r, convert = TRUE)%>%
group_by(product, rates)%>%
summarise(av = mean(r), .groups = 'drop')
product rates av
<chr> <chr> <dbl>
1 001 7.5% ( 01-01 to 31-01 ) ; 0% ( 01-02 to 30-06 ) ; 7.5% ( 01-07 to 31-12 ) 5
2 002 0% (01 FEB - 30 JUN); 3.70% (01-01 to 31-01) ; 3.70% (01-07 to 31-12) 2.47