A column in my dataframe has the following campaign contribution data formatted in one of two ways:
JOHN A. DONOR1234 W ROAD ST CITY, STATE 56789
And
JANE M. DONOR
1234 W ROAD ST
CITY, STATE 56789
I want to split this column into two. Column one should be the name of the donor. Column two should be the address.
Currently, I'm using the following regex
code to try and accomplish this:
url = ("http://www.voterfocus.com/CampaignFinance/candidate_pr.php?op=rp&e=8&c=munmiamibeach&ca=64&sdc=116&rellevel=4&dhc=774&committee=N")
dfs = pd.read_html(url)
df = dfs[0]
df['Contributor'].str.split(r'\d\d?', expand=True)
But instead of splitting after the first match and quitting - as I intend - the regex seems continue matching and splitting. My output should looke like this:
Col1 Col2
JOHN A. DONOR 1234 W ROAD ST CITY, STATE 56789
CodePudding user response:
it may be much simpler than that. You can use the string methods. For example, I think this is the behavior you want:
import pandas as pd
s = """JOHN A. DONOR
1234 W ROAD ST
CITY, STATE 56789"""
df = pd.DataFrame([s], columns=["donors"])
df.donors.str.split("\n", 1, expand=True)
output:
0 1
0 JOHN A. DONOR 1234 W ROAD ST\nCITY, STATE 56789
CodePudding user response:
Splitting solution
You can use
df['Contributor'].str.split(r'(?<=\D)(?=\d)', expand=True, n=1)
The (?<=\D)(?=\d)
regex finds a location between a non-digit char (\D
) and a digit char (\d
), splits the string there and only performs this operation once (due to n=1
).
Alternative solution
You can match and capture the names up to the first number, and then capture all text there remains after and including the first digit using
df['Contributor'].str.extract(r'(?P<Name>\D*)(?P<Address>\d.*)', expand=True)
# => Name # Address
# 0 Contributor CHRISTIAN ULVERT 1742 W FLAGLER STMIAMI, FL 33135
# 1 Contributor Roger Thomson 4271 Alton Miami Beach , FL 33140
# 2 Contributor Steven Silverstein 691 West 247th Street Bronx , NY 10471
# 3 Contributor Cathy Raduns 691 West 247th Street Bronx, NY 10471
# 4 Contributor Asher Raduns-Silverstein 691 West 247th StreetBRONX, NY 10471
The (?P<Name>\D*)(?P<Address>\d.*)
pattern means
(?P<Name>\D*)
- Group "Name": zero or more chars other than digits(?P<Address>\d.*)
- Group "Address": a digit and then any zero or more chars other than line break chars.
If there are line breaks in the string, add (?s)
at the start of the pattern, i.e. r'(?s)(?P<Name>\D*)(?P<Address>\d.*)'
.
See the regex demo.