I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns? Suppose the column looks like this.
A\nB |
---|
1\n2 |
2\n3 |
3\n4 |
Desired output: |A|B| |-|-| |1|2| |2|3| |3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error df[colNew].str.split('\n', 2, expand=True) AttributeError: DataFrame object has no attribute 'str'
CodePudding user response:
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
CodePudding user response:
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes. But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')