Home > Software design >  Pandas: splitting a string and returning multiple elements does not work - only one element is retur
Pandas: splitting a string and returning multiple elements does not work - only one element is retur

Time:11-10

A dataframe I am working on contains an address line, from which I wish to extract the different variables (street name, number, postal code and country). The address line is filled according to the following template: "street number, postal code city, country".

Using .str.split(), I can get invividual elements from the string. Returning multiple elements however does not work.

So, extracting the country is easy enough. But when I try to get the street and number I run into some troubles. The street can contain one or more whitespaces, but after the last whitespace there is always the number. So I try to do this:

df["street_and_number"]=df["address"].str.split(",").str[0]

which creates a column with the street and the number. So far so good.

If I want to get the street, splitting on whitespaces and extracting everything but the last element seems (to me) a pretty straightforward option:

df["street"]=df["street_and_number"].str.split().str[:-1]

But this returns a list, see example below:

1541    [Burgemeester, Roelenweg]
Name: straat_en_nr, dtype: object

I would like to get a string, instead of a list with all the elements. Using the expand option doesn't produce the required result either:

df["street"]=df["street_and_number"].str.split(expand=True)[:-1]

Output:


    0   1   2

What am I doing wrong?

CodePudding user response:

Add Series.str.join:

df["street"] = df["street_and_number"].str.split().str[:-1].str.join(' ')

CodePudding user response:

If what you want is to extract the beginning of the string up to the first comma, except the last word, I believe a one step regex would be appropriate:

df['street'] = df['address'].str.extract(r'^([^,] )\s \S ,')

Example:

                                         address       street
0  street name number, postal code city, country  street name

regex demo

Or you can match on digits:

df['street'] = df['address'].str.extract(r'^([^,\d] )(?:\s \d )?,')

Example:

                                       address            street
0   street name 123, postal code city, country       street name 
1  street name only, postal code city, country  street name only

regex demo

  • Related