Conditional loop for a dataframe in R-CodePudding

I have a dataframe (tibble) containing information about sentences. The dataframe has the following structure:

word	position	category	related_word	sentence
a	1	det	2	1
man	2	noun	3	1
sees	3	verb	0	1
a	4	det	5	1
horse	5	noun	3	1
and	6	conj	7	1
a	7	det	8	1
dog	8	noun	3	1

I would like to create a loop that looks at every sentence in the dataframe (the sentence number is in the last column), then if there is a noun in the dataframe (category =="noun"), finds its related word by using the value of related_word in the same row as the noun. The value of related_word corresponds to the position of the related word. The loop would then add both words (the noun and its related word) in a new column (in the format "word" "word").

For the dataframe I provided below, there are three nouns in the first sentence. So the loop would first use the first noun (=="man"), and find its related word by using the value of related_word (==3). Since this value == 3, that related word is "sees". Then the loop would write in the same row as the word "man" the complete pair, i.e. "man sees" in a new column (called "pair").

For the remaining two nouns ("horse" and "dog", the new column would hold the following values: "horse see" and "dog see".

How could I approach this? There are a few problems here but the main one is how to use the value of related_word in order to find the values of a different variables. E.g. how can I get from "man" to "sees"?

CodePudding user response：

You can join the table on itself.. (join on sentence, and on position equaling related_word). Here is a start - perhaps give us more information about what you want the output to look like?

df %>%
  inner_join(filter(df,category=="noun"), by=c("sentence"="sentence", "position"="related_word")) %>% 
  mutate(newcol = paste(word.y,word.x)) %>% 
  select(sentence, newcol)

Output:

# A tibble: 3 × 2
  sentence newcol    
     <int> <chr>     
1        1 man sees  
2        1 horse sees
3        1 dog sees

The output can be something slightly different: wrap the above in a left_join() [ notice that in this iteration I retain position.y in the final select statement, to facilitate the join:

df %>% left_join(
  df %>%
    inner_join(filter(df,category=="noun"), by=c("sentence"="sentence", "position"="related_word")) %>% 
    mutate(newcol = paste(word.y,word.x)) %>% 
    select(sentence, position.y, newcol),
  by=c("sentence"="sentence", "position" = "position.y")
)

Output:

# A tibble: 8 × 6
  word  position category related_word sentence newcol    
  <chr>    <int> <chr>           <int>    <int> <chr>     
1 a            1 det                 2        1 NA        
2 man          2 noun                3        1 man sees  
3 sees         3 verb                0        1 NA        
4 a            4 det                 5        1 NA        
5 horse        5 noun                3        1 horse sees
6 and          6 conj                7        1 NA        
7 a            7 det                 8        1 NA        
8 dog          8 noun                3        1 dog sees