Home > other >  Problem of regex to separate country names with numbers that follow them
Problem of regex to separate country names with numbers that follow them

Time:01-16

I have to clean a column "Country" of a DataFrame where, sometimes, the country names are followed by numbers (for example we will see "France6" instead of France). I would like to separate the country name from the number that follows it.

I coded this function to solve the problem:

def new_name2(row):
    for item in re.finditer("([a-zA-Z]*)(\d*)",row.Country):
        row.Country=item.group(1)
    return row

We can see that I created two groups, the first one to catch the country name, and the other to separate the number. Following that, I should get (France)(6).

Unfortunately, when I run it, my Country column turns empty. This means that the first group that I get is not "France" but "" and I don't understand why, because on a regex website, I can see that my expression ([a-zA-Z]*)(\d*) is working.

CodePudding user response:

Your loop rewrites row.Country each time even with a zero-length match!

Instead, you could strip off the numbers directly

df["Country"] = df["Country"].str.rstrip("0123456789")

Using a dedicated Pandas method will almost-certainly be much faster than simple Python loop due to vectorizing

CodePudding user response:

Add a beginning and ending match like this:

^([a-zA-Z]*)(\d*)$

This will force it to match the entire string. Perhaps that was the problem.

If that doesn't work, try logging the regex result. Maybe your inputs are faulty.

  • Related