Home > front end >  Calculate mean based on range within a cell
Calculate mean based on range within a cell

Time:04-12

I'm new to R (and Stackoverflow, so sorry about the formatting of the question), and working on an assignment for class.

As part of this, I have a data frame with a chr column of the number of participants in an event. Some of these are whole values, whereas some are ranges.

To proceed, I need to replace the ranges with the midrange based on the values within the cell.

As an example:

df <- data.frame(event=c('Football', 'Basketball', 'Football', 'Tennis', 'Basketball'),
                 num_participants=c(157, 220, 175-189, 190-220, 320-350)

How would I replace the ranges with the mean of the max/min of the range, while maintaining the singular values (e.g., 157 & 220)?

CodePudding user response:

Expand the column into left-side and right-side using separate. Since the separate is a function used to deal with strings, we have to change the column type to numeric and calculate the mean statisitcs.

library(tidyverse)

df <- data.frame(event=c('Football', 'Basketball', 'Football', 'Tennis', 'Basketball'),
                 num_participants=c("157", "220", "175-189", "190-220", "320-350")) 

df %>% 
  separate(col = num_participants, sep = "-", into = c("min", "max")) %>%
  mutate(max = as.numeric(ifelse(is.na(max), min, max)),
         min = as.numeric(min),
         mean = (min max)/2) 
#
       event min max mean
1   Football 157 157  157
2 Basketball 220 220  220
3   Football 175 189  182
4     Tennis 190 220  205
5 Basketball 320 350  335

Note that separate will create missing value when there is no delimeter found in the observation. Example showed below :

> df %>% 
    separate(col = num_participants, sep = "-", into = c("min", "max"))

#
       event min  max
1   Football 157 <NA>
2 Basketball 220 <NA>
3   Football 175  189
4     Tennis 190  220
5 Basketball 320  350

Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 2]. 

I replace the missing in column max to the same value as min.

CodePudding user response:

We may use strsplit to split at the -, convert the list elements to numeric and get the mean

df$num_participants <- sapply(strsplit(df$num_participants, '-'),
     \(x) mean(as.numeric(x)))

Or another option is to read the column with read.table and use rowMeans

df$num_participants <- rowMeans(read.table(text = df$num_participants,
     header = FALSE,
     sep = '-', fill = TRUE), na.rm = TRUE)
df$num_participants
[1] 157 220 182 205 335

data

df <- structure(list(event = c("Football", "Basketball", "Football", 
"Tennis", "Basketball"), num_participants = c("157", "220", "175-189", 
"190-220", "320-350")), class = "data.frame", row.names = c(NA, 
-5L))
  •  Tags:  
  • r
  • Related