Home > Mobile >  Regex pattern to extract Hearst patterns
Regex pattern to extract Hearst patterns

Time:10-14

I am new to Regex and I am unable to extract hyponym-hypernym pairs in the form of a list or tuple. I tried using this pattern but I get no matches

(NP_[\w.]*(, NP_[\w.]*)*,? (and)? other NP_[\w.]*)

I have the following annotated sentences for 'and other' pattern:

  1. NP_kimmel faces NP_dui , NP_fleeing or NP_evading_police , and other NP_possible_charges .
  2. The NP_network has asked NP_big_bang_theory_co-creator_bill prady to mastermind the NP_revival , which would see the NP_return of NP_kermit the NP_frog , NP_miss_piggy , NP_fozzie_bear and other NP_old_favorites .

I want to extract a list such as :

[NP_dui,NP_fleeing or NP_evading_police, NP_possible_charges]

OR

(NP_dui,NP_possible_charges)
(NP_fleeing or NP_evading_police,NP_possible_charges)

Similarly for the sentence 2:

[NP_kermit the NP_frog , NP_miss_piggy , NP_fozzie_bear, NP_old_favorites]

or Similar tuples.

Any help would be appreciated.

CodePudding user response:

Use

NP_[\w.]*(?:\s*(?:,|\bor\b|,?\s*and(?:\s other)?\b)\s*NP_[\w.]*) 

This extracts strings with your matches. Next, extract expected ents with NP_[\w.]*.

Python code:

import re

test_strs = ["NP_kimmel faces NP_dui , NP_fleeing or NP_evading_police , and other NP_possible_charges.",
"The NP_network has asked NP_big_bang_theory_co-creator_bill prady to mastermind the NP_revival , which would see the NP_return of NP_kermit the NP_frog , NP_miss_piggy , NP_fozzie_bear and other NP_old_favorites ."]
p = r'NP_[\w.]*(?:\s*(?:,|\bor\b|,?\s*and(?:\s other)?\b)\s*NP_[\w.]*) '

for test_str in test_strs:
    matches = []
    for match in re.findall(p, test_str):
        matches.extend(re.findall(r'NP_[\w.]*\b', match))
    print(matches)

Results: ['NP_dui', 'NP_fleeing', 'NP_evading_police', 'NP_possible_charges']
['NP_frog', 'NP_miss_piggy', 'NP_fozzie_bear', 'NP_old_favorites']

EXPLANATION

--------------------------------------------------------------------------------
  NP_                      'NP_'
--------------------------------------------------------------------------------
  [\w.]*                   any character of: word characters (a-z, A-
                           Z, 0-9, _), '.' (0 or more times (matching
                           the most amount possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      ,                        ','
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
--------------------------------------------------------------------------------
      or                       'or'
--------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      ,?                       ',' (optional (matching the most
                               amount possible))
--------------------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ")
                               (0 or more times (matching the most
                               amount possible))
--------------------------------------------------------------------------------
      and                      'and'
--------------------------------------------------------------------------------
      (?:                      group, but do not capture (optional
                               (matching the most amount possible)):
--------------------------------------------------------------------------------
        \s                       whitespace (\n, \r, \t, \f, and " ")
                                 (1 or more times (matching the most
                                 amount possible))
--------------------------------------------------------------------------------
        other                    'other'
--------------------------------------------------------------------------------
      )?                       end of grouping
--------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w)
                               and something that is not a word char
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    NP_                      'NP_'
--------------------------------------------------------------------------------
    [\w.]*                   any character of: word characters (a-z,
                             A-Z, 0-9, _), '.' (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of grouping
  • Related