I'm trying to summarise f
requency data that occur in 'middle' position
s, i.e., between the first and the last position
. My approach to this task is to filter for these data, do the summarise
, and then rejoin the new data with the dataframe from which they were filtered. This works well with the training data:
library(tidyverse)
df %>%
group_by(rowid) %>%
# summarize frequencies for middle postions:
filter(position != first(position) & position != last(position)) %>%
# summarise:
summarize(across(position),
middle_position = mean(f, na.rm = TRUE),
word = str_c(word, collapse=" ")
) %>%
left_join(df, ., by = c("rowid", "position"))
However, applied to my actual data, I get this error message:
Error in `left_join()`:
! Join columns must be present in data.
✖ Problem with `position`.
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/rlang_error>
Error in `left_join()`:
! Join columns must be present in data.
✖ Problem with `position`.
---
Backtrace:
1. ... %>% left_join(bnc_X, ., by = c("rowid", "position"))
3. dplyr:::left_join.data.frame(bnc_X, ., by = c("rowid", "position"))
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/rlang_error>
Error in `left_join()`:
! Join columns must be present in data.
✖ Problem with `position`.
---
Backtrace:
▆
1. ├─... %>% left_join(bnc_X, ., by = c("rowid", "position"))
2. ├─dplyr::left_join(bnc_X, ., by = c("rowid", "position"))
3. └─dplyr:::left_join.data.frame(bnc_X, ., by = c("rowid", "position"))
4. └─dplyr:::join_mutate(...)
5. └─dplyr:::join_cols(...)
6. └─dplyr:::standardise_join_by(...)
7. └─dplyr:::check_join_vars(by$x, x_names, error_call = error_call)
8. └─rlang::abort(bullets, call = error_call)
The main problem seems to be the variable position
- why is it not recognized? I've spent a good few hours trying to solve the issue but couldn't, and would be grateful for help!
Data:
df <- data.frame(
size = c(3,3,3,
3,3,3,
4,4,4,4,
5,5,5,5,5,
3,3,3),
rowid = c(1,1,1,2,2,2,3,3,3,3,4,4,4,4,4,5,5,5),
turn = c(rep("How are you?",3),
rep("I'm fine.",3),
rep("How's the weather?",4),
rep("It's really very cold.",5),
rep("I love you",3)),
word = c("how","are","you",
"i","'m","fine",
"how","'s","the","weather",
"it","'s","really", "very","cold",
"i","love","you"),
f = c(400,300,250,
600,555,1,
400,500,700,20,
390,500,177,200,35,
600,199,400),
position = c(1,2,3,
1,2,3,
1,2,3,4,
1,2,3,4,5,
1,2,3)
)
CodePudding user response:
This works for me in data.table
. No joins needed.
library(data.table)
# set to data.table
setDT(df)
# get id's of middle rows by rowid
idx = df[, .(idx = .I[-c(1L, .N)]), by = .(rowid)]$idx
# update these middle rows
df[idx, `:=`(middle_position = mean(f),
word_midddel = paste0(word, collapse = " ")),
by = .(rowid)]
size rowid turn word f position middle_position word_midddel
1: 3 1 How are you? how 400 1 NA <NA>
2: 3 1 How are you? are 300 2 300.0000 are
3: 3 1 How are you? you 250 3 NA <NA>
4: 3 2 I'm fine. i 600 1 NA <NA>
5: 3 2 I'm fine. 'm 555 2 555.0000 'm
6: 3 2 I'm fine. fine 1 3 NA <NA>
7: 4 3 How's the weather? how 400 1 NA <NA>
8: 4 3 How's the weather? 's 500 2 600.0000 's the
9: 4 3 How's the weather? the 700 3 600.0000 's the
10: 4 3 How's the weather? weather 20 4 NA <NA>
11: 5 4 It's really very cold. it 390 1 NA <NA>
12: 5 4 It's really very cold. 's 500 2 292.3333 's really very
13: 5 4 It's really very cold. really 177 3 292.3333 's really very
14: 5 4 It's really very cold. very 200 4 292.3333 's really very
15: 5 4 It's really very cold. cold 35 5 NA <NA>
16: 3 5 I love you i 600 1 NA <NA>
17: 3 5 I love you love 199 2 199.0000 love
18: 3 5 I love you you 400 3 NA <NA>