I have to clean a column "Country" of a DataFrame where, sometimes, the country names are followed by numbers (for example we will see "France6" instead of France). I would like to separate the country name from the number that follows it.
I coded this function to solve the problem:
def new_name2(row):
for item in re.finditer("([a-zA-Z]*)(\d*)",row.Country):
row.Country=item.group(1)
return row
We can see that I created two groups, the first one to catch the country name, and the other to separate the number. Following that, I should get (France)(6).
Unfortunately, when I run it, my Country column turns empty. This means that the first group that I get is not "France" but "" and I don't understand why, because on a regex website, I can see that my expression ([a-zA-Z]*)(\d*)
is working.
CodePudding user response:
Your loop rewrites row.Country
each time even with a zero-length match!
Instead, you could strip off the numbers directly
df["Country"] = df["Country"].str.rstrip("0123456789")
Using a dedicated Pandas method will almost-certainly be much faster than simple Python loop due to vectorizing
CodePudding user response:
Add a beginning and ending match like this:
^([a-zA-Z]*)(\d*)$
This will force it to match the entire string. Perhaps that was the problem.
If that doesn't work, try logging the regex result. Maybe your inputs are faulty.