Home > database >  get id for pattern matches
get id for pattern matches

Time:10-04

I want to extract collocates of the lemma GO.

df <- data.frame(
  id = 1:6,
  go = c("go after it", "here we go", "he went bust", "go get it go", 
         "i 'm gon na go", "she 's going berserk"))

I can extract the collocates like this:

# lemma forms:
lemma_GO <- c("go", "goes", "going", "gone", "went", "gon na") 

# alternation pattern:
pattern_GO <- paste0("\\b(", paste0(lemma_GO, collapse = "|"), ")\\b")

# extraction:
library(stringr)
df_GO <- data.frame(
  left = unlist(str_extract_all(df$go, paste0("('?\\b[a-z'] \\b|^)(?=\\s?", pattern_GO, ")"))),
  node = unlist(str_extract_all(df$go, pattern_GO)),
  right = unlist(str_extract_all(df$go, paste0("(?<=\\s?", pattern_GO, "\\s?)('?\\b[a-z'] \\b|$)")))
)

The result is fine BUT it does not show the id value, i.e., I don't know from which 'sentence' the matches were extracted:

df_GO
  left   node   right
1          go   after
2   we     go        
3   he   went    bust
4          go     get
5   it     go        
6   'm gon na      go
7   na     go        
8   's  going berserk

How can the idvalue be fetched so that the outcome is this:

df_GO
  left   node   right    id
1          go   after     1
2   we     go             2   
3   he   went    bust     3
4          go     get     4
5   it     go             4   
6   'm gon na      go     5
7   na     go             5  
8   's  going berserk     6

CodePudding user response:

You are almost there. What you need to do is to loop/iterate over your dataframe and perform the operation on each row. This allows you to extract and store the id as well.

For this to happen we wrap your steps into a function call and add the id to it.

The following uses the tidyverse packages, in particular {purrr} for the iteration.

library(tidyverse)

# wrap your call into a function that we perform on each row
extract_GO <- function(df_row){
    df_GO <- data.frame(
        id = df_row$id,    # we also store the id for the row we process

#---------------------- your work - just adapted the variable to function call, df_row
## this could have stayed the same, but this way it is easier to understand
## what happens here
        left = unlist(str_extract_all(df_row$go, paste0("('?\\b[a-z'] \\b|^)(?=\\s?", pattern_GO, ")"))),
        node = unlist(str_extract_all(df_row$go, pattern_GO)),
        right = unlist(str_extract_all(df_row$go, paste0("(?<=\\s?", pattern_GO, "\\s?)('?\\b[a-z'] \\b|$)")))
    )
}

# --------------- next we iterate with purrr
## try df %>% group_split(id) to see what group_split() does

df %>% 
   group_split(id) %>%    # splits data frame into list of bins, i.e. by id
   purrr::map_dfr(.x, .f = ~ extract_GO(.x))  # now we iterate over bins with our function

This yields:

  id left   node   right
1  1          go   after
2  2   we     go        
3  3   he   went    bust
4  4          go     get
5  4   it     go        
6  5   'm gon na      go
7  5   na     go        
8  6   's  going berserk
  • Related