I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like <p>
, <strong>
, etc.
I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.
For example, from the below input, I would expect to extract just a small example link to a webpage
:
<p>just a small <a href="#">
example</a> link</p><p>to a webpage</p>
As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.
So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:
/(<[^>]*>)/
In a sense, I need the negative image of this expression but have not be able to build it myself.
Your help in "negating" the above expression is most appreciated.
CodePudding user response:
Assuming there are no <
or >
elsewhere or entity-encoded, an idea using a lookbehind.
(?:(?<=>)|^)[^<]
(?:(?<=>)|^)
is an alternation between either ^
start of the string or looking behind for any >
. From there [^<]
matches one or more characters that are not <
(negated character class).