Home > Software engineering >  Using Regex to Scrape Text
Using Regex to Scrape Text

Time:06-14

Here is my code so far that i'm having trouble matching text before "v." and some other terms.

(?s)v\..*?(?:\d\d\d\d\))

Sample Text:

See Holiday in v. Marriot (2002)

e.g. FB v. Google (2012)

; Yahoo! v. Microsoft (2000)"

I need to be about to grab:

Holiday in v. Marriot (2002)

FB v. Google (2012)
 
Yahoo! v. Microsoft (2000)

CodePudding user response:

If there are only single uppercase words, you could start with an uppercase char followed by dots or other uppercase chars and then any char except uppercase chars till v.

(?s)\b[A-Z][A-Z.]*[^A-Z]*v\..*?\(\d{4}\)

Regex demo

Another option could be specifying the possible leading chars using an alternation | with a capture group:

(?s)(?:\bSee\b|\be\.g\.|;)\s*(.*?\s v\..*?\(\d{4}\))
  • (?s) Inline modifier to have the dot match a newline
  • (?:\bSee\b|\be\.g\.|;) Match one of the alternatives
  • \s* Match optional whitespace chars
  • ( Capture group 1
    • .*?\s v\. Match as least as possible chars and then v.
    • .*?\(\d{4}\) Match as least as possible chars and then 4 digits between parenthesis
  • ) Close group 1

Regex demo

CodePudding user response:

Use

See\s (.*)\s \S \s (.*)\s ;\s (.*)

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  See                      'See'
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \S                       non-whitespace (all but \n, \r, \t, \f,
                           and " ") (1 or more times (matching the
                           most amount possible))
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  ;                        ';'
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \3
  • Related