I have a rough df extracted from a PDF data table that is composed of a single column containing strings. Each entry in the column is in the following form:
Mayor ............... Paul Jones
Senator ................. Billy Twister
Congress Rep. .......... Chris Rock
Chief of Staff ....... Tony Allen
I would like to use string.split(pat=".")
to split these strings into two columns for Title
and Name
using the period strings. However, each entry in the column has different lengths for the period string delimiter. As expected, if I use pat="."
or pat="......."
, for instance, I end up with inconsistent splits.
My goal is to implement a way for the delimiter to be able to handle a range of period string lengths for a given minimum and maximum length. I have googled around with this question and have not been able to find a direct answer. Is this possible to implement?
CodePudding user response:
Use Series.str.split
with the regex \s \. \s
, which splits by 1 spaces, 1 periods, 1 spaces:
df = pd.DataFrame({'A': ['Mayor ............... Paul Jones', 'Senator ................. Billy Twister', 'Congress Rep. .......... Chris Rock', 'Chief of Staff ....... Tony Allen']})
df[['Title', 'Name']] = df['A'].str.split('\s \. \s ', expand=True)
# A Title Name
# 0 Mayor ............... Paul Jones Mayor Paul Jones
# 1 Senator ................. Billy Twister Senator Billy Twister
# 2 Congress Rep. .......... Chris Rock Congress Rep. Chris Rock
# 3 Chief of Staff ....... Tony Allen Chief of Staff Tony Allen