I'm new to R (and Stackoverflow, so sorry about the formatting of the question), and working on an assignment for class.
As part of this, I have a data frame with a chr
column of the number of participants in an event. Some of these are whole values, whereas some are ranges.
To proceed, I need to replace the ranges with the midrange based on the values within the cell.
As an example:
df <- data.frame(event=c('Football', 'Basketball', 'Football', 'Tennis', 'Basketball'),
num_participants=c(157, 220, 175-189, 190-220, 320-350)
How would I replace the ranges with the mean of the max/min of the range, while maintaining the singular values (e.g., 157 & 220)?
CodePudding user response:
Expand the column into left-side and right-side using separate
. Since the separate is a function used to deal with strings, we have to change the column type to numeric and calculate the mean
statisitcs.
library(tidyverse)
df <- data.frame(event=c('Football', 'Basketball', 'Football', 'Tennis', 'Basketball'),
num_participants=c("157", "220", "175-189", "190-220", "320-350"))
df %>%
separate(col = num_participants, sep = "-", into = c("min", "max")) %>%
mutate(max = as.numeric(ifelse(is.na(max), min, max)),
min = as.numeric(min),
mean = (min max)/2)
#
event min max mean
1 Football 157 157 157
2 Basketball 220 220 220
3 Football 175 189 182
4 Tennis 190 220 205
5 Basketball 320 350 335
Note that separate will create missing value when there is no delimeter found in the observation. Example showed below :
> df %>%
separate(col = num_participants, sep = "-", into = c("min", "max"))
#
event min max
1 Football 157 <NA>
2 Basketball 220 <NA>
3 Football 175 189
4 Tennis 190 220
5 Basketball 320 350
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 2].
I replace the missing in column max
to the same value as min
.
CodePudding user response:
We may use strsplit
to split at the -
, convert the list
elements to numeric
and get the mean
df$num_participants <- sapply(strsplit(df$num_participants, '-'),
\(x) mean(as.numeric(x)))
Or another option is to read the column with read.table
and use rowMeans
df$num_participants <- rowMeans(read.table(text = df$num_participants,
header = FALSE,
sep = '-', fill = TRUE), na.rm = TRUE)
df$num_participants
[1] 157 220 182 205 335
data
df <- structure(list(event = c("Football", "Basketball", "Football",
"Tennis", "Basketball"), num_participants = c("157", "220", "175-189",
"190-220", "320-350")), class = "data.frame", row.names = c(NA,
-5L))