Home > Blockchain >  How to match text and skip HTML tags using a regular expression?
How to match text and skip HTML tags using a regular expression?

Time:01-23

I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like <p>, <strong>, etc.

I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.

For example, from the below input, I would expect to extract just a small example link to a webpage:

  <p>just a small <a href="#">
  example</a> link</p><p>to a webpage</p> 

As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.

So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:

/(<[^>]*>)/

In a sense, I need the negative image of this expression but have not be able to build it myself.

Your help in "negating" the above expression is most appreciated.

CodePudding user response:

Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind.

(?:(?<=>)|^)[^<] 

See this demo at regex101

(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any >. From there [^<] matches one or more characters that are not < (negated character class).

  • Related