Home > Software engineering >  How to split a string column into two columns with a 'variable' delimiter?
How to split a string column into two columns with a 'variable' delimiter?

Time:11-28

I have a rough df extracted from a PDF data table that is composed of a single column containing strings. Each entry in the column is in the following form:

Mayor ............... Paul Jones
Senator ................. Billy Twister
Congress Rep. .......... Chris Rock
Chief of Staff ....... Tony Allen

I would like to use string.split(pat=".") to split these strings into two columns for Title and Name using the period strings. However, each entry in the column has different lengths for the period string delimiter. As expected, if I use pat="." or pat=".......", for instance, I end up with inconsistent splits.

My goal is to implement a way for the delimiter to be able to handle a range of period string lengths for a given minimum and maximum length. I have googled around with this question and have not been able to find a direct answer. Is this possible to implement?

CodePudding user response:

Use Series.str.split with the regex \s \. \s , which splits by 1 spaces, 1 periods, 1 spaces:

df = pd.DataFrame({'A': ['Mayor ............... Paul Jones', 'Senator ................. Billy Twister', 'Congress Rep. .......... Chris Rock', 'Chief of Staff ....... Tony Allen']})

df[['Title', 'Name']] = df['A'].str.split('\s \. \s ', expand=True)

#                                          A           Title           Name
# 0         Mayor ............... Paul Jones           Mayor     Paul Jones
# 1  Senator ................. Billy Twister         Senator  Billy Twister
# 2      Congress Rep. .......... Chris Rock   Congress Rep.     Chris Rock
# 3        Chief of Staff ....... Tony Allen  Chief of Staff     Tony Allen
  • Related