Home > database >  Function to remove a part of a string before a capital letter in Pandas Series
Function to remove a part of a string before a capital letter in Pandas Series

Time:11-26

I have a dataframe that includes a column ['locality_name'] with names of villages, towns, cities. Some names are written like "town of Hamilton", some like "Hamilton", some like "city of Hamilton" etc. As such, it's hard to count unique values etc. My goal is to leave the names only.

I want to write a function that removes the part of a string till the capital letter and then apply it to my dataframe.

That's what I tried:

import re

def my_slicer(row): """ Returns a string with the name of locality """ return re.sub('ABCDEFGHIKLMNOPQRSTVXYZ','', row['locality_name'])

raw_data['locality_name_only'] = raw_data.apply(my_slicer, axis=1)

I excpected it to return a new column with the names of places. Instead, nothing changed - ['locality_name_only'] has the same values as in ['locality_name'].

CodePudding user response:

You can use pandas.Series.str.extract. For the example :

ser = pd.Series(["town of Hamilton", "Hamilton", "city of Hamilton"])
ser_2= ser.str.extract("([A-Z][a-z] -?\w )")

In your case, use :

raw_data['locality_name_only'] = raw_data['locality_name'].str.extract("([A-Z][a-z] -?\w )")

# Output :

print(ser_2)

          0
0  Hamilton
1  Hamilton
2  Hamilton

CodePudding user response:

I would use str.replace and phrase the problem as removing all non uppercase words:

raw_data["locality_name_only"] = df["locality_name"].str.replace(r'\s*\b[a-z]\w*\s*', ' ', regex=True).str.strip()
  • Related