Home > front end >  Can I use regex to split pandas dataframe column on first match only?
Can I use regex to split pandas dataframe column on first match only?

Time:10-25

A column in my dataframe has the following campaign contribution data formatted in one of two ways:

JOHN A. DONOR1234 W ROAD ST CITY, STATE 56789

And

JANE M. DONOR
1234 W ROAD ST
CITY, STATE 56789

I want to split this column into two. Column one should be the name of the donor. Column two should be the address.

Currently, I'm using the following regex code to try and accomplish this:

url = ("http://www.voterfocus.com/CampaignFinance/candidate_pr.php?op=rp&e=8&c=munmiamibeach&ca=64&sdc=116&rellevel=4&dhc=774&committee=N")
dfs = pd.read_html(url)
df = dfs[0]
df['Contributor'].str.split(r'\d\d?', expand=True)

But instead of splitting after the first match and quitting - as I intend - the regex seems continue matching and splitting. My output should looke like this:

Col1                    Col2
JOHN A. DONOR          1234 W ROAD ST CITY, STATE 56789

CodePudding user response:

it may be much simpler than that. You can use the string methods. For example, I think this is the behavior you want:

import pandas as pd
s = """JOHN A. DONOR
1234 W ROAD ST
CITY, STATE 56789"""

df = pd.DataFrame([s], columns=["donors"])
df.donors.str.split("\n", 1, expand=True)

output:

               0                                  1
0  JOHN A. DONOR  1234 W ROAD ST\nCITY, STATE 56789

CodePudding user response:

Splitting solution

You can use

df['Contributor'].str.split(r'(?<=\D)(?=\d)', expand=True, n=1)

The (?<=\D)(?=\d) regex finds a location between a non-digit char (\D) and a digit char (\d), splits the string there and only performs this operation once (due to n=1).

Alternative solution

You can match and capture the names up to the first number, and then capture all text there remains after and including the first digit using

df['Contributor'].str.extract(r'(?P<Name>\D*)(?P<Address>\d.*)', expand=True)
# =>                                            Name                                         #   Address
#   0                  Contributor  CHRISTIAN ULVERT                1742 W FLAGLER STMIAMI, FL 33135
#   1                     Contributor  Roger Thomson               4271 Alton Miami Beach , FL 33140
#   2                Contributor  Steven Silverstein          691 West 247th Street Bronx , NY 10471
#   3                      Contributor  Cathy Raduns           691 West 247th Street Bronx, NY 10471
#   4          Contributor  Asher Raduns-Silverstein            691 West 247th StreetBRONX, NY 10471

The (?P<Name>\D*)(?P<Address>\d.*) pattern means

  • (?P<Name>\D*) - Group "Name": zero or more chars other than digits
  • (?P<Address>\d.*) - Group "Address": a digit and then any zero or more chars other than line break chars.

If there are line breaks in the string, add (?s) at the start of the pattern, i.e. r'(?s)(?P<Name>\D*)(?P<Address>\d.*)'.

See the regex demo.

  • Related