This is a bit of a weird one -
I have scraped some data out of a pdf to create a table using tabula.
The table doesn't come out perfect and it merges some of the columns together. As an example 3 of the columns all get put into 1 column. No problem to split that with a delimiter, I can do something like -
columns_split = 'A B C'
df_split = df[columns_split].str.split(' ', expand=True)
The problem - one of the columns that gets merged has values that may go up to 4 digits. For some reason the values in the pdf table that are 4 digits long have a space in them so instead of '1000' the values are '1 000'. This consequently gets honoured by tabula when it reads the pdf. The merged column that comes in looks a bit like this; where instead of three separate columns A, B and C they are all ploughed in to one.
df['A B C']
Out[24]:
1 A B C
2 1 000 60,0 300,0
3 1 000 60,0 200,0
4 999 60,0 200,0
5 888 70,0 100,0
6 2 777 80,0 100,0
7 666 80,0 90,0
8 555 50,0 100,0
9 111 50,0 100,0
Name: A B C, dtype: object
The knock on effect is when I run the split command it will split the columns where it finds a space in those values and so I end up with 4 columns with some values being knocked along. My initial thought was to ignore the first first delimiter, but that wouldn't apply to the rows that don't have the issue.
columns_split = 'A B C'
df_split = df[columns_split].str.split(' ', expand=True)
df_split
Out[26]:
0 1 2 3
1 A B C None
2 1 000 60,0 300,0
3 1 000 60,0 200,0
4 999 60,0 200,0 None
5 888 70,0 100,0 None
6 2 777 80,0 100,0
7 666 80,0 90,0 None
8 555 50,0 100,0 None
9 111 50,0 100,0 None
Hopefully someone out there has an inspirational idea how to get around this. Editing the pdf sadly is not an option, the above is for illustrative purposed and alas this is not the only pdf I need to run this on.
Thanks in advance for any help.
CodePudding user response:
If only the first column has those incorrect spaces, use str.rsplit
with a max of 2 splits:
df['A B C'].str.rsplit(n=2, expand=True)
Output:
0 1 2
1 A B C
2 1 000 60,0 300,0
3 1 000 60,0 200,0
4 999 60,0 200,0
5 888 70,0 100,0
6 2 777 80,0 100,0
7 666 80,0 90,0
8 555 50,0 100,0
9 111 50,0 100,0