I am using tabula to read tables form pdf files.
tables = tabula.read_pdf(file, pages="all")
This works fine. Now tables is a list of dataframes, where each data frame is a table fromt eh pdf file.
However the table rows are indexed 0,1,2,3.. etc. But the first row is taken as the column name or header of each dataframe.
Current dataframe:
Component manufacturer DMNS
0 Component name KL32/OOH8
1 Component type LTE-M/NB-IoT
2 Package markings <pin 1 marker>\ ksdc 99cdjh
3 Date code Not discerned
4 Package type 127-pin land grid array (LGA)
5 Package size 26.00 mm × 10.11 mm × 3.05 mm
Desired Dataframe:
0 1
0 Component manufacturer DMNS
1 Component name KL32/OOH8
2 Component type LTE-M/NB-IoT
3 Package markings <pin 1 marker>\ ksdc e99cdjh
4 Date code Not discerned
5 Package type 127-pin land grid array (LGA)
6 Package size 26.00 mm × 10.11 mm × 3.05 mm
How can I do this transformation?
CodePudding user response:
As the tabula docs on read_pdf
state, you can add pandas_options
and they even give the one you need as an example - {'header': None}
. So (something like) this should do the trick:
tabula.read_pdf(file, pages="all", pandas_options={'header': None})
Edit: So apparently that should only work if you set multiple_tables
to False
which is not the default. I'd play with the options a bit and if it doesn't give the desired result, here is a post on how to turn the column names into the first row.
CodePudding user response:
Here's a way to do what your question asks:
df = df.T.reset_index().T.reset_index(drop=True)
Output:
0 1
0 Component manufacturer DMNS
1 Component name KL32/OOH8
2 Component type LTE-M/NB-IoT
3 Package markings <pin 1 marker>\ ksdc 99cdjh
4 Date code Not discerned
5 Package type 127-pin land grid array (LGA)
6 Package size 26.00 mm × 10.11 mm × 3.05 mm
Explanation:
- Transpose the dataframe so we can use
reset_index()
to convert the index (i.e., the original column labels) to a new initial column - Transpose it again so the new initial column becomes an initial row, and use
reset_index()
to get a fresh integer index.