I am writing a script to scrape a series of tables in a pdf into python using tabula-py.
This is fine. I do get the data. But the data is multi-line, and useless in reality.
I would like to merge the rows where the first column (Tag is not NaN
).
I was about to put the whole thing in an iterator, and do it manually, but I realize that pandas is a powerful tool, but I don't have the pandas vocabulary to search for the right tool. Any help is much appreciated.
My Code
filename='tags.pdf'
tagTableStart=2 #784
tagTableEnd=39 #822
tableHeadings = ['Tag','Item','Length','Description','Value']
pageRange = "%d-%d" % (tagTableStart, tagTableEnd)
print ("Scanning pages %s" % pageRange)
# extract all the tables in that page range
tables = tabula.read_pdf(filename, pages=pageRange)
How The data is stored in the DataFrame:
(Empty fields are NaN
)
Tag | Item | Length | Description | Value |
---|---|---|---|---|
AA | Some | 2 | Very Very | |
Text | Very long | |||
Value | ||||
AB | More | 4 | Other Very | aaaa |
Text | Very long | bbbb | ||
Value | cccc |
How I want the data:
This is almost as it is displayed in the pdf (I couldn't figure out how to make text multi line in SO editor)
Tag | Item | Length | Description | Value |
---|---|---|---|---|
AA | Some\nText | 2 | Very Very\nVery long\nValue | |
AB | More\nText | 4 | Other Very\nVery long\n Value | aaaa\nbbbb\ncccc |
Actual sample output (obfuscated)
Tag Item Length Description Value
0 AA PYTHROM-PARTY-I 20 Some Current defined values are :
1 NaN NaN NaN texst Byte1:
2 NaN NaN NaN NaN C
3 NaN NaN NaN NaN DD
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN DD
6 NaN NaN NaN NaN DD
7 NaN NaN NaN NaN DD
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN B :
10 NaN NaN NaN NaN JLSAFISFLIHAJSLIhdsflhdliugdyg89o7fgyfd
11 NaN NaN NaN NaN ISFLIHAJSLIhdsflhdliugdyg89o7fgyfd
12 NaN NaN NaN NaN upon ISFLIHAJSLIhdsflhdliugdyg89o7fgy
13 NaN NaN NaN NaN asdsadct on the dasdsaf the
14 NaN NaN NaN NaN actsdfion.
15 NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN SLKJDBFDLFKJBDSFLIUFy7dfsdfiuojewv
17 NaN NaN NaN NaN csdfgfdgfd.
18 NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN fgfdgdfgsdfgfdsgdfsgfdgfdsgsdfgfdg
20 BB PRESENT-AMOUNT-BOX 11 Lorem Ipsum NaN
21 CC SOME-OTHER-VALUE 1 sdlkfgsdsfsdf 1
22 NaN NaN NaN device NaN
23 NaN NaN NaN ueghkjfgdsfdskjfhgsdfsdfkjdshfgsfliuaew8979vfhsdf NaN
24 NaN NaN NaN dshf87hsdfe4ir8hod9 NaN
CodePudding user response:
Create groups from ID
columns then join each rows:
agg_func = dict(zip(df.columns, [lambda s: '\n'.join(s).strip()] * len(df.columns)))
out = df.fillna('').groupby(df['Tag'].ffill(), as_index=False).agg(agg_func)
Output:
>>> out
Tag Item Length Description Value
0 AA Some\nText 2 Very Very\nVery long\nValue
1 AB More\nText 4 Other Very\nVery long\nValue aaaa\nbbbb\ncccc
agg_func
is equivalent to write:
{'Tag': lambda s: '\n'.join(s).strip(),
'Item': lambda s: '\n'.join(s).strip(),
'Length': lambda s: '\n'.join(s).strip(),
'Description': lambda s: '\n'.join(s).strip(),
'Value': lambda s: '\n'.join(s).strip()}