Home > Enterprise >  Pandas splitting a column with new line separator
Pandas splitting a column with new line separator

Time:04-12

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns? Suppose the column looks like this.

A\nB
1\n2
2\n3
3\n4

Desired output: |A|B| |-|-| |1|2| |2|3| |3|4|

I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using colNew = df.columns[df.columns.str.contains(pat = '\n')]

However when I pass colNew in split function, it throws an attribute error df[colNew].str.split('\n', 2, expand=True) AttributeError: DataFrame object has no attribute 'str'

CodePudding user response:

You can take advantage of the Pandas split function.

import pandas as pd
# recreate your pandas series above. 
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})

# first: Turn the col into str. 
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)

# some rename
test.columns = ['A','B']

I hope this is helpful.

CodePudding user response:

I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes. But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].

Then another line to split the column headers:

df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')
  • Related