I have a pandas data frame data
with several columns. One of these columns is GEN
. This column contains german cities as strings. Some of these cities are in a bad format, meaning that they have values like "Frankfurt a.Main"
. For every element in data['GEN']
I would like to replace every expression of the form "\.[A-ZÄÖÜ]"
(i.e. dot followed by upper case letter) by the corresponding expression "\.\b[A-ZÄÖÜ]"
. For example
"Frankfurt a.Main"
becomes"Frankfurt a. Main"
"Frankfurt a.d.Oder"
becomes"Frankfurt a.d. Oder"
and so on.
I am pretty sure that pandas.Series.str.contains
and pandas.Series.str.replace
are helpful here, but one of my problems is that I don't know how to put the replacement task in a form that can be used by the above functions.
CodePudding user response:
You can use pandas.Series.str.replace
to capture the two groups that compose a german city name in your original data and then add a whitespace between them.
Try this :
data['GEN'] = data['GEN'].str.replace(r'(\w \s.*\.)(\w*)', r'\1 \2', regex=True)
# Output :
0 Frankfurt a. Main
1 Frankfurt a.d. Oder
CodePudding user response:
You could assert a dot to the left using a positive lookbehind (?<=\.)
and match one of [A-ZÄÖÜ]
In the replacement use a space followed by the full match using \g<0>
import pandas as pd
pattern = r"(?<=\.)[A-ZÄÖÜ]"
items = [
"Frankfurt a.Main",
"Frankfurt a.d.Oder"
]
data = pd.DataFrame(items, columns=["GEN"])
data['GEN'] = data['GEN'].str.replace(pattern, r' \g<0>')
print(data)
Output
GEN
0 Frankfurt a. Main
1 Frankfurt a.d. Oder