Home > Enterprise >  Merging rows in pandas DataFrame
Merging rows in pandas DataFrame

Time:10-27

I am writing a script to scrape a series of tables in a pdf into python using .

This is fine. I do get the data. But the data is multi-line, and useless in reality.
I would like to merge the rows where the first column (Tag is not NaN).
I was about to put the whole thing in an iterator, and do it manually, but I realize that is a powerful tool, but I don't have the pandas vocabulary to search for the right tool. Any help is much appreciated.

My Code

filename='tags.pdf'
tagTableStart=2 #784
tagTableEnd=39 #822

tableHeadings = ['Tag','Item','Length','Description','Value']

pageRange = "%d-%d" % (tagTableStart, tagTableEnd)
print ("Scanning pages %s" % pageRange)

# extract all the tables in that page range
tables = tabula.read_pdf(filename, pages=pageRange)

How The data is stored in the DataFrame:

(Empty fields are NaN)

Tag Item Length Description Value
AA Some 2 Very Very
Text Very long
Value
AB More 4 Other Very aaaa
Text Very long bbbb
Value cccc

How I want the data:

This is almost as it is displayed in the pdf (I couldn't figure out how to make text multi line in SO editor)

Tag Item Length Description Value
AA Some\nText 2 Very Very\nVery long\nValue
AB More\nText 4 Other Very\nVery long\n Value aaaa\nbbbb\ncccc

Actual sample output (obfuscated)

    Tag                         Item Length                                        Description                                    Value
0    AA              PYTHROM-PARTY-I     20                                               Some             Current defined values are :
1   NaN                          NaN    NaN                                              texst                                   Byte1:
2   NaN                          NaN    NaN                                                NaN                                        C
3   NaN                          NaN    NaN                                                NaN                                       DD 
4   NaN                          NaN    NaN                                                NaN                                      NaN
5   NaN                          NaN    NaN                                                NaN                                       DD
6   NaN                          NaN    NaN                                                NaN                                       DD
7   NaN                          NaN    NaN                                                NaN                                       DD
8   NaN                          NaN    NaN                                                NaN                                      NaN
9   NaN                          NaN    NaN                                                NaN                                   B    :
10  NaN                          NaN    NaN                                                NaN  JLSAFISFLIHAJSLIhdsflhdliugdyg89o7fgyfd
11  NaN                          NaN    NaN                                                NaN       ISFLIHAJSLIhdsflhdliugdyg89o7fgyfd
12  NaN                          NaN    NaN                                                NaN    upon ISFLIHAJSLIhdsflhdliugdyg89o7fgy
13  NaN                          NaN    NaN                                                NaN              asdsadct on the dasdsaf the
14  NaN                          NaN    NaN                                                NaN                               actsdfion.
15  NaN                          NaN    NaN                                                NaN                                      NaN
16  NaN                          NaN    NaN                                                NaN       SLKJDBFDLFKJBDSFLIUFy7dfsdfiuojewv
17  NaN                          NaN    NaN                                                NaN                              csdfgfdgfd.
18  NaN                          NaN    NaN                                                NaN                                      NaN
19  NaN                          NaN    NaN                                                NaN       fgfdgdfgsdfgfdsgdfsgfdgfdsgsdfgfdg
20   BB           PRESENT-AMOUNT-BOX    11                          Lorem Ipsum                                                     NaN
21   CC           SOME-OTHER-VALUE      1                                        sdlkfgsdsfsdf                                        1
22  NaN                          NaN    NaN                                             device                                      NaN
23  NaN                          NaN    NaN  ueghkjfgdsfdskjfhgsdfsdfkjdshfgsfliuaew8979vfhsdf                                      NaN
24  NaN                          NaN    NaN                                dshf87hsdfe4ir8hod9                                      NaN

CodePudding user response:

Create groups from ID columns then join each rows:

agg_func = dict(zip(df.columns, [lambda s: '\n'.join(s).strip()] * len(df.columns)))

out = df.fillna('').groupby(df['Tag'].ffill(), as_index=False).agg(agg_func)

Output:

>>> out
  Tag        Item Length                   Description             Value
0  AA  Some\nText      2   Very Very\nVery long\nValue                  
1  AB  More\nText      4  Other Very\nVery long\nValue  aaaa\nbbbb\ncccc

agg_func is equivalent to write:

{'Tag': lambda s: '\n'.join(s).strip(),
 'Item': lambda s: '\n'.join(s).strip(),
 'Length': lambda s: '\n'.join(s).strip(),
 'Description': lambda s: '\n'.join(s).strip(),
 'Value': lambda s: '\n'.join(s).strip()}
  • Related