Row mean in a sequence across multiple columns-CodePudding

I have monthly data for the state unemployment rates from 2017 to 2019, and 2022. I want to get the yearly average unemployment rate for each state.

Is there a way to use sequences or for loops to get the mean for every 12th column starting from column 2? Then for 2022, it would only be the mean across three columns (January to March)

I currently have the below but this is terribly inefficient. Especially when I start working with much larger data sets.

EDIT

# Edit: Below shows unemployment rates for year 2017 (not showing 2018-2022)

> df[2:13]
   Jan.2017 Feb.2017 Mar.2017 Apr.2017 May.2017 Jun.2017 Jul.2017 Aug.2017 Sep.2017 Oct.2017 Nov.2017 Dec.2017
1       5.5      5.2      5.0      4.8      4.6      4.4      4.3      4.2      4.1      4.0      4.0      4.0
2       6.6      6.6      6.5      6.5      6.5      6.5      6.5      6.5      6.5      6.5      6.5      6.4
3       5.2      5.2      5.1      5.0      5.0      4.9      4.9      4.8      4.9      4.9      4.9      4.9
4       3.8      3.7      3.7      3.7      3.7      3.7      3.7      3.7      3.8      3.8      3.8      3.8

# I am using the below to get the average from Jan2017-Dec2017, Jan 2018-Dec 2018 etc.

df$x2017 <- rowMeans(df[ , c(2:13)], na.rm=TRUE)
df$x2018 <- rowMeans(df[ , c(14:25)], na.rm=TRUE)
df$x2019 <- rowMeans(df[ , c(26:37)], na.rm=TRUE)
df$x2021 <- rowMeans(df[ , c(38:49)], na.rm=TRUE)
df$x2022 <- rowMeans(df[ , c(50:52)], na.rm=TRUE)

# output
  State x2017 x2018 x2019 x2021 x2022
1               Alabama   8.0   7.2   6.6   6.1   5.9
2                Alaska   7.2   7.0   6.6   6.3   6.5
3               Arizona   8.3   7.7   6.7   6.0   5.6
4              Arkansas   7.2   6.9   5.7   4.7   4.0

I’m just looking for something that reduces potential for errors when telling it which columns to get the averages for.

made up df if needed

state = c("Alabama", "Alaska", "Arizona")
Jan2017 = c(1:3)
Feb2017 = c(4:6)
Jan2018 = c(7:9)
Feb2018 = c(10:12)
Jan2019 = c(13:15)
Feb2019 = c(16:18)
df3=data.frame(state,Jan2017,Feb2017,Jan2018,Feb2018,Jan2019,Feb2019)

> df3
    state Jan2017 Feb2017 Jan2018 Feb2018 Jan2019 Feb2019
1 Alabama       1       4       7      10      13      16
2  Alaska       2       5       8      11      14      17
3 Arizona       3       6       9      12      15      18

CodePudding user response：

cbind(df3[1], sapply(split.default(df3[-1], sub("\\D ", "", names(df3)[-1])), rowMeans))

    state 2017 2018 2019
1 Alabama  2.5  8.5 14.5
2  Alaska  3.5  9.5 15.5
3 Arizona  4.5 10.5 16.5

CodePudding user response：

Here is a tidyverse solution with pivoting and summarizing:

library(dplyr)
library(tidyr)

df3 %>% 
  pivot_longer(-state) %>% 
  mutate(helper = parse_number(name)) %>% 
  group_by(state, helper) %>% 
  mutate(mean = mean(value, na.rm=TRUE)) %>% 
  pivot_wider(names_from = helper,
              values_from = mean) %>% 
  group_by(state) %>% 
  summarise(across(-c(name, value), mean, na.rm = TRUE), .groups = 'drop')

       state   `2017` `2018` `2019`
  <chr>    <dbl>  <dbl>  <dbl>
1 Alabama    2.5    8.5   14.5
2 Alaska     3.5    9.5   15.5
3 Arizona    4.5   10.5   16.5

CodePudding user response：

Here is another slightly different tidyverse option:

library(tidyverse)

df3 %>%
  pivot_longer(
    cols = -state,
    names_to = c(NA, ".value"),
    names_sep = "(?<=[a-z])(?=[0-9])"
  ) %>%
  group_by(state) %>% 
  summarize(across(everything(), mean, na.rm = TRUE))

Output

  state   `2017` `2018` `2019`
  <chr>    <dbl>  <dbl>  <dbl>
1 Alabama    2.5    8.5   14.5
2 Alaska     3.5    9.5   15.5
3 Arizona    4.5   10.5   16.5