A dataframe I am working on contains an address line, from which I wish to extract the different variables (street name, number, postal code and country). The address line is filled according to the following template: "street number, postal code city, country".
Using .str.split(), I can get invividual elements from the string. Returning multiple elements however does not work.
So, extracting the country is easy enough. But when I try to get the street and number I run into some troubles. The street can contain one or more whitespaces, but after the last whitespace there is always the number. So I try to do this:
df["street_and_number"]=df["address"].str.split(",").str[0]
which creates a column with the street and the number. So far so good.
If I want to get the street, splitting on whitespaces and extracting everything but the last element seems (to me) a pretty straightforward option:
df["street"]=df["street_and_number"].str.split().str[:-1]
But this returns a list, see example below:
1541 [Burgemeester, Roelenweg]
Name: straat_en_nr, dtype: object
I would like to get a string, instead of a list with all the elements. Using the expand option doesn't produce the required result either:
df["street"]=df["street_and_number"].str.split(expand=True)[:-1]
Output:
0 1 2
What am I doing wrong?
CodePudding user response:
Add Series.str.join
:
df["street"] = df["street_and_number"].str.split().str[:-1].str.join(' ')
CodePudding user response:
If what you want is to extract the beginning of the string up to the first comma, except the last word, I believe a one step regex would be appropriate:
df['street'] = df['address'].str.extract(r'^([^,] )\s \S ,')
Example:
address street
0 street name number, postal code city, country street name
Or you can match on digits:
df['street'] = df['address'].str.extract(r'^([^,\d] )(?:\s \d )?,')
Example:
address street
0 street name 123, postal code city, country street name
1 street name only, postal code city, country street name only