I have a dataframe (tibble) containing information about sentences. The dataframe has the following structure:
word | position | category | related_word | sentence |
---|---|---|---|---|
a | 1 | det | 2 | 1 |
man | 2 | noun | 3 | 1 |
sees | 3 | verb | 0 | 1 |
a | 4 | det | 5 | 1 |
horse | 5 | noun | 3 | 1 |
and | 6 | conj | 7 | 1 |
a | 7 | det | 8 | 1 |
dog | 8 | noun | 3 | 1 |
I would like to create a loop that looks at every sentence in the dataframe (the sentence number is in the last column), then if there is a noun in the dataframe (category =="noun"), finds its related word by using the value of related_word in the same row as the noun. The value of related_word corresponds to the position of the related word. The loop would then add both words (the noun and its related word) in a new column (in the format "word" "word").
For the dataframe I provided below, there are three nouns in the first sentence. So the loop would first use the first noun (=="man"), and find its related word by using the value of related_word (==3). Since this value == 3, that related word is "sees". Then the loop would write in the same row as the word "man" the complete pair, i.e. "man sees" in a new column (called "pair").
For the remaining two nouns ("horse" and "dog", the new column would hold the following values: "horse see" and "dog see".
How could I approach this? There are a few problems here but the main one is how to use the value of related_word in order to find the values of a different variables. E.g. how can I get from "man" to "sees"?
CodePudding user response:
You can join the table on itself.. (join on sentence
, and on position
equaling related_word
). Here is a start - perhaps give us more information about what you want the output to look like?
df %>%
inner_join(filter(df,category=="noun"), by=c("sentence"="sentence", "position"="related_word")) %>%
mutate(newcol = paste(word.y,word.x)) %>%
select(sentence, newcol)
Output:
# A tibble: 3 × 2
sentence newcol
<int> <chr>
1 1 man sees
2 1 horse sees
3 1 dog sees
The output can be something slightly different: wrap the above in a left_join()
[ notice that in this iteration I retain position.y
in the final select
statement, to facilitate the join:
df %>% left_join(
df %>%
inner_join(filter(df,category=="noun"), by=c("sentence"="sentence", "position"="related_word")) %>%
mutate(newcol = paste(word.y,word.x)) %>%
select(sentence, position.y, newcol),
by=c("sentence"="sentence", "position" = "position.y")
)
Output:
# A tibble: 8 × 6
word position category related_word sentence newcol
<chr> <int> <chr> <int> <int> <chr>
1 a 1 det 2 1 NA
2 man 2 noun 3 1 man sees
3 sees 3 verb 0 1 NA
4 a 4 det 5 1 NA
5 horse 5 noun 3 1 horse sees
6 and 6 conj 7 1 NA
7 a 7 det 8 1 NA
8 dog 8 noun 3 1 dog sees