Home > Blockchain >  Grouping problem in R to find averages with a condition
Grouping problem in R to find averages with a condition

Time:10-17

I have a dataframe similar to this:

df <- data.frame(flight_no = c(515,4370,3730,4687,1124), dep_delay = c(-10, 95, -7, 4, 6), is_delayed = c('no', 'yes', 'no', 'yes', 'yes'), distance = c(1065,628,719,569,2565))
 
#>   flight_no dep_delay is_delayed  distance
#> 1  515       -10          'no'      1065
#> 2  4370      95           'yes'     628
#> 3  3730      -7           'no'      719
#> 4  4687      4            'yes'     569
#> 4  1124      6            'yes'     2565

I need to find the average (mean) delay for flights going over 1000 miles, and the average (mean) delay for flights going less than 1000 miles filtering for the delayed flights only.

I have tried this

df %>%
  filter(is_delayed =='yes') %>%                            # Find delayed flights
  group_by(distance >1000) %>%                              # Group by distance over 1000 miles
  summarise(avg = mean(dep_delay),                    # Summarise and find the mean delay
                        count = n())

Output:
A tibble: 2 × 3
  `distance > 1000`   avg count
  <lgl>             <dbl> <int>
1 FALSE              49.5     2
2 TRUE                6       1

It seems correct. is there actually a way to change FALSE and TRUE to 'distance less than 1000' and 'distance more than 1000', respectively? Maybe there is a better way to to do this. I'm new to R.

CodePudding user response:

You may conveniently use aggregate for that.

aggregate(dep_delay ~ distance > 1000, df, subset=is_delayed == 'yes', 
          \(x) c(mean=mean(x), n=length(x)))
#   distance > 1000 dep_delay.mean dep_delay.n
# 1           FALSE           49.5         2.0
# 2            TRUE            6.0         1.0

CodePudding user response:

You can use ifelse to change the levels, and round to round the values.

df %>% 
  filter(is_delayed == "yes") %>% 
  group_by(distance_1000 = ifelse(distance > 1000, "distance more than 1000", "distance less or equal to 1000")) %>% 
  summarise(avg = round(mean(dep_delay), 2),
            count = n())

#                    distance_1000  avg count
# 1 distance less or equal to 1000 49.5     2
# 2        distance more than 1000  6.0     1
  • Related