Home > Software engineering >  Replace 'middle' frequencies with averaged frequency values
Replace 'middle' frequencies with averaged frequency values

Time:01-10

I have this type of data, with frequency data and position data grouped by rowid:

df
   rowid    word   f position
1      2       i 700        1
2      2      'm 600        2
3      2    fine   1        3
4      3     how 400        1
5      3      's 500        2
6      3     the 700        3
7      3 weather  20        4
8      4      it 390        1
9      4      's 500        2
10     4  really 177        3
11     4    very 200        4
12     4    cold  35        5
13     5       i 700        1
14     5    love 199        2
15     5     you 400        3

The task I'm facing seems simple: in those rowids where there are more than 3 positions, I need to replace the frequencies of all middle positions with their average. The following approach works but seems over-convoluted, so I'm almost certain there will be a more straightforward dplyrway to get the desired output:

df %>%
  group_by(rowid) %>%
  # filter for 'middle' positions:
  filter(position != first(position) & position != last(position)) %>%
  # summarise:
  summarize(across(position),
            # create average frequency:
            f_middle_position = mean(f, na.rm = TRUE),
            # concatenate words:
            word = str_c(word, collapse = " ")
            ) %>%
  filter(!duplicated(f_middle_position)) %>%
  # join with df:
  left_join(df, ., by = c("rowid", "position")) %>%
  # remove rows other than #1,#2, and last:
  group_by(rowid) %>%
  # create row count:
  mutate(rn = row_number()) %>%
  # filter first, second, and last row per group:
  filter(rn %in% c(1, 2, last(rn))) %>%
  # transfer frequencies for middle positions:
  mutate(f = ifelse(is.na(f_middle_position), f, f_middle_position)) %>%
  # make more changes:
  mutate(
    # change position labels:
    position = ifelse(position == first(position), 1,
                           ifelse(position == last(position), 2, 1.5)),
    # update word:
    word = ifelse(is.na(word.y), word.x, word.y)
         ) %>%
  # remove obsolete variables:
  select(-c(f_middle_position, word.y, word.x,rn))
 A tibble: 12 × 4
# Groups:   rowid [4]
   rowid     f position word          
   <dbl> <dbl>    <dbl> <chr>         
 1     2  700       1   i             
 2     2  600       1.5 'm            
 3     2    1       2   fine          
 4     3  400       1   how           
 5     3  600       1.5 's the        
 6     3   20       2   weather       
 7     4  390       1   it            
 8     4  292.      1.5 's really very
 9     4   35       2   cold          
10     5  700       1   i             
11     5  199       1.5 love          
12     5  400       2   you 

How can this result be obtained in a more concise way in dplyr and, preferably without the left_join, which causes problems with my actual data?

Data:

df <- data.frame(
  rowid = c(2,2,2,3,3,3,3,4,4,4,4,4,5,5,5),
  word = c("i","'m","fine",
           "how","'s","the","weather",
           "it","'s","really", "very","cold",
           "i","love","you"),
  f = c(700,600,1,
        400,500,700,20,
        390,500,177,200,35,
        700,199,400),
  position = c(1,2,3,
               1,2,3,4,
               1,2,3,4,5,
               1,2,3)
)

CodePudding user response:

You can create a group variable pos that marks the first row with 1, the middle with 1.5, and the last with 2. Then group the data by rowid and pos and apply mean() and paste() on f and word respectively.

library(dplyr)

df %>%
  group_by(rowid) %>% 
  mutate(pos = case_when(position == 1 ~ 1, position == n() ~ 2, TRUE ~ 1.5)) %>%
  group_by(rowid, pos) %>%
  summarise(f = mean(f), word = paste(word, collapse = ' '), .groups = 'drop')

# # A tibble: 12 × 4
#    rowid   pos     f word          
#    <dbl> <dbl> <dbl> <chr>         
#  1     2   1    700  i             
#  2     2   1.5  600  'm            
#  3     2   2      1  fine          
#  4     3   1    400  how           
#  5     3   1.5  600  's the        
#  6     3   2     20  weather       
#  7     4   1    390  it            
#  8     4   1.5  292. 's really very
#  9     4   2     35  cold          
# 10     5   1    700  i             
# 11     5   1.5  199  love          
# 12     5   2    400  you
  • Related