Using Regex to Scrape Text-CodePudding

Here is my code so far that i'm having trouble matching text before "v." and some other terms.

(?s)v\..*?(?:\d\d\d\d\))

Sample Text:

See Holiday in v. Marriot (2002)

e.g. FB v. Google (2012)

; Yahoo! v. Microsoft (2000)"

I need to be about to grab:

Holiday in v. Marriot (2002)

FB v. Google (2012)
 
Yahoo! v. Microsoft (2000)

CodePudding user response：

If there are only single uppercase words, you could start with an uppercase char followed by dots or other uppercase chars and then any char except uppercase chars till v.

(?s)\b[A-Z][A-Z.]*[^A-Z]*v\..*?\(\d{4}\)

Regex demo

Another option could be specifying the possible leading chars using an alternation | with a capture group:

(?s)(?:\bSee\b|\be\.g\.|;)\s*(.*?\s v\..*?\(\d{4}\))

(?s) Inline modifier to have the dot match a newline
(?:\bSee\b|\be\.g\.|;) Match one of the alternatives
\s* Match optional whitespace chars
( Capture group 1
- .*?\s v\. Match as least as possible chars and then v.
- .*?\(\d{4}\) Match as least as possible chars and then 4 digits between parenthesis
) Close group 1

Regex demo

CodePudding user response：

Use

See\s (.*)\s \S \s (.*)\s ;\s (.*)

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  See                      'See'
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \S                       non-whitespace (all but \n, \r, \t, \f,
                           and " ") (1 or more times (matching the
                           most amount possible))
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  ;                        ';'
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )                        end of \3