i have this dataframe, and i want to extract cities in a separate column. You can also see, that the format is not same, and the city can be anywhere in the row. How can i extract only cities i a new column? Prompt. Here we are talking about German cities. May be to find a dictionary, that shows all German cities and somehow compare with my dataset?
Here is dictionary of german cities: https://gist.github.com/embayer/772c442419999fa52ca1
Dataframe
Adresse
0 Karlstr 10, 10 B, 30,; 04916 **Hamburg**
1 **München** Dorfstr. 28-55, 22555
2 Marnstraße. Berlin 12, 45666 **Berlin**
3 Musterstr, 24855 **Dresden**
... ...
850 Muster Hausweg 11, **Hannover**, 56668
851 Mariestr. 4, 48669 **Nürnberg**
852 **Hilden** Weederstr 33-55, 56889
853 Pt-gaanen-Str. 2, 45883 **Potsdam**
Output
Cities
0 Hamburg
1 München
2 Berlin
3 Dresden
... ...
850 Hannover
851 Nürnberg
852 Hilden
853 Potsdam
CodePudding user response:
You can use regular expression to extract the city names, as they are indicated by **
:
import re
import pandas
df = pd.DataFrame({"Adresse": ["Karlstr 10, 10 B, 30,; 04916 **Hamburg**", "**München** Dorfstr. 28-55, 22555", "Marnstraße. Berlin 12, 45666 **Berlin**", "Musterstr, 24855 **Dresden**"]})
df['Cities'] = [re.findall(r".*\*\*(.*)\*\*", address)[0] for address in df['Adresse']]
This results in:
df
Adresse Cities
0 Karlstr 10, 10 B, 30,; 04916 **Hamburg** Hamburg
1 **München** Dorfstr. 28-55, 22555 München
2 Marnstraße. Berlin 12, 45666 **Berlin** Berlin
3 Musterstr, 24855 **Dresden** Dresden
CodePudding user response:
You could extract in a list all the cities from the dictionary you provided ( I asssume it's the 'stadt' key
), and then use str.findall
in your column:
cities_ = [cities[n]['stadt'] for n in range(0,len(cities))]
df.Adresse.str.findall(r'|'.join(cities_))
>>>
0 [Karlstr, Hamburg]
1 []
2 []
3 []
4 []
5 []
6 []
7 []
8 []
Name: Adresse, dtype: object
CodePudding user response:
You can simply use str.extract
since all the names are between couple of stars.
df["cities"] = df["Adress"].str.extract(r'\*\*(\w )\*\*')