How to match text and skip HTML tags using a regular expression?-CodePudding

I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like <p>, <strong>, etc.

I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.

For example, from the below input, I would expect to extract just a small example link to a webpage:

  <p>just a small <a href="#">
  example</a> link</p><p>to a webpage</p>

As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.

So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:

/(<[^>]*>)/

In a sense, I need the negative image of this expression but have not be able to build it myself.

Your help in "negating" the above expression is most appreciated.

CodePudding user response：

Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind.

(?:(?<=>)|^)[^<]

See this demo at regex101

(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any >. From there [^<] matches one or more characters that are not < (negated character class).