My question is very simple but I couldn't figure it out by myself: how to I match superscripted text with regex in Python? I'd like to match patterns like [a-zA-Z0-9,[]] but only if it's superscripted.
regards
CodePudding user response:
The main problem is that information about "superscript" and "subscript" text is not conveyed at the character level.
Unicode even defines some characters to be used as sub and super-script, most notably, all decimal digits have a corresponding character - but just a handful of other latin letters have a full super-script or sub-script character with its own code. Check: https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts
So, if you want to match digits only, you could just put the corresponding characters in the regular expressions: "\u207X" (with X varying from 0 to 9) plus "\u00BX" with X in {2, 3, 9} - the table in the linked wikipedia article has all characters.
For the remaining characters, what takes place when we are faced with superscript text is that it is formatting information in a protocol separated from the characters: for example if you are dealing with HTML markup text, text inside the <sup> </sup>
marks.
Just as happen with HTML, in any instance you find superscript text, have to be marked in a text-protocol "outside" of the characters themselves - and therefore, outside what you'd look up in the characters themselves with a regular expression.
If you are dealing with HTML text, you can search your text for the "<sup>
" tag, for example. However, if it is formatted text inside a web page, there are tens of ways of marking the superscript text, as the super-script transformation can be specified in CSS, and the CSS may be applied to the page-text in several different ways.
Other text-protocols exist that might encode super-script text, like "rich-text" (rtf files) . Otherwise you have to say how the text you are dealing with is encoded, and how it does encode the markup for superscript text, in order for a proper regular expression to be built.
If it is plain HTML using "<sup>
" tags, it could be as simple as:
re.findall(r"\<sup.*?\>(.*?)\<\/sup")
. Otherwise, you should inspect your text stream, find out the superscript markup, and use an appropriate regexp, or even another better parsing tool (for HTML/XML you are usually better off using beautifulsoup or other XML tools than regexps, for example).
And, of course, that applies if the information for which text is superscripted is embedded in the text channel, as some kind of markup. It might be on a side-channel: another data block telling at which text indexes the effect of superscript should apply. In this case, you essentially have to figure that out, and then use that information directly.