Home > Blockchain >  How to remove data frame header and make it a row
How to remove data frame header and make it a row

Time:07-25

I am using tabula to read tables form pdf files.

tables = tabula.read_pdf(file, pages="all")

This works fine. Now tables is a list of dataframes, where each data frame is a table fromt eh pdf file.

However the table rows are indexed 0,1,2,3.. etc. But the first row is taken as the column name or header of each dataframe.

Current dataframe:

  Component manufacturer               DMNS
0         Component name               KL32/OOH8
1         Component type               LTE-M/NB-IoT
2       Package markings               <pin 1 marker>\ ksdc 99cdjh
3              Date code               Not discerned
4           Package type               127-pin land grid array (LGA)
5           Package size               26.00 mm × 10.11 mm × 3.05 mm

Desired Dataframe:

        0                                1
0       Component manufacturer           DMNS
1       Component name                   KL32/OOH8
2       Component type                   LTE-M/NB-IoT
3       Package markings                 <pin 1 marker>\ ksdc e99cdjh
4       Date code                        Not discerned
5       Package type                     127-pin land grid array (LGA)
6       Package size                     26.00 mm × 10.11 mm × 3.05 mm

How can I do this transformation?

CodePudding user response:

As the tabula docs on read_pdf state, you can add pandas_options and they even give the one you need as an example - {'header': None}. So (something like) this should do the trick:

tabula.read_pdf(file, pages="all", pandas_options={'header': None})

Edit: So apparently that should only work if you set multiple_tables to False which is not the default. I'd play with the options a bit and if it doesn't give the desired result, here is a post on how to turn the column names into the first row.

CodePudding user response:

Here's a way to do what your question asks:

df = df.T.reset_index().T.reset_index(drop=True)

Output:

                        0                              1
0  Component manufacturer                           DMNS
1          Component name                      KL32/OOH8
2          Component type                   LTE-M/NB-IoT
3        Package markings    <pin 1 marker>\ ksdc 99cdjh
4               Date code                  Not discerned
5            Package type  127-pin land grid array (LGA)
6            Package size  26.00 mm × 10.11 mm × 3.05 mm

Explanation:

  • Transpose the dataframe so we can use reset_index() to convert the index (i.e., the original column labels) to a new initial column
  • Transpose it again so the new initial column becomes an initial row, and use reset_index() to get a fresh integer index.
  • Related