I have a problem with construct regular expression-CodePudding

I have a data frame where row in one column looks like this:

<title>Some text</title>

<selftext>Some text</selftext>

This above is one row in one column. The problem is that not every row looks like this. I have to implement that rows which not looks like this was removed.

I tried to use code below:

pattern = "<title>[a-zA-Z0-9]</title>\n\n<selftext>[a-zA-Z0-9]</selftext>"
for row in df.column_name:
    if row == pattern:
        print(row)

and I don't have any rows printed, although I should.

CodePudding user response：

My first idea for what is wrong with the pattern would be that you set a range but only allow exactly one character. Use this to allow any content within title and selftext tags which have at least one character.

pattern = "<title>[a-zA-Z0-9] </title>\n\n<selftext>[a-zA-Z0-9] </selftext>"

Also you did not call an actual regex pattern. You just did a string comparison. So unless the content would be exactly [a-zA-Z0-9] it wouldnt match.

Use it like this:

import re
pattern = "<title>[a-zA-Z0-9] </title>\n\n<selftext>[a-zA-Z0-9] </selftext>"
for row in df.column_name:
    if re.match(pattern, row):
        print(row)

Edit: Unless you also want to filter the content by following exactly the right character set and numbers range, I would recommend making the pattern much more broad. Basically XML allows for everything except Tags (<, >) within the tags. So you could just match until the next opening tag. While you're at it you can also allow empty tags as these can also occur in XML.

import re
pattern = "<title>[^<]*</title>\n\n<selftext>[^<]*</selftext>"
for row in df.column_name:
    if re.match(pattern, row):
        print(row)