Home > Enterprise >  How to extract all arrays in a pdf?
How to extract all arrays in a pdf?

Time:11-24

Is there a way to extract data from every arrays in a pdf using python?

I've tested tabula, camelot, pdfplumber but none can extract everything or correctly.

An example:

enter image description here

I would like to work on these using matrix, dataframe, ...

Should I opt for OCR for better recognition ?

EDIT :

I am trying to retrieve this table from a pdf using tabula-py.

My script :

tables = tabula.read_pdf(filename, pages="3", output_format="dataframe", multiple_tables=True)
print(tables)

The output :

[                                amortization (EBITDA)     205     306     263     284      255
0                             Operating profit (EBIT)     125     243     207     221      191 
1                                      Net financials     (3)     (7)     (8)     (5)     (13) 
2                      Profit for the year before tax     122     247     201     216      178 
3        Profit for the year of continuing operations      92     192     154     160      138 
4   Profit/loss for the year of discontinued opera...       -       3    (14)       5    (134) 
5                                 Profit for the year      92     195     140     165        4 
6                                                 NaN     NaN     NaN     NaN     NaN      NaN 
7                     STATEMENT OF FINANCIAL POSITION     NaN     NaN     NaN     NaN      NaN 
8                                        Total assets   1,393   1,444   1,852   1,854    2,022 
9         Average invested capital including goodwill     772     736     659     708      914 
10                                Net working capital     318     314     268     314      279 
11                                       Total equity     723     740     884     833      809 
12                           Non-controlling interest      10       7       5       4        4 
13             Net interest-bearing debt, end of year      17      25      82      52      118 
14                                                NaN     NaN     NaN     NaN     NaN      NaN 
15                            STATEMENT OF CASH FLOWS     NaN     NaN     NaN     NaN      NaN 
16                Cash flow from operating activities     175     183     226     264      232 
17                Cash flow from investing activities    (88)      55      15    (91)    (167) 
18       Investments in property, plant and equipment    (72)    (81)    (45)    (77)     (58) 
19                                     Free cash flow      87     238     241     173       65 
20                Cash flow from financing activities    (79)   (319)   (172)   (109)     (35) 
21                         Net cash flow for the year       8    (81)      69      64       30 
22                                                NaN     NaN     NaN     NaN     NaN      NaN 
23                                     KEY RATIOS (%)     NaN     NaN     NaN     NaN      NaN 
24                                     Revenue growth     3.2     1.0     2.9     5.7      5.5 
25                                       Gross margin    55.3    56.8    54.8    57.3     56.6 
26                                         Cost ratio    50.7    47.7    47.0    49.3     48.7 
27                                      EBITDA margin     7.5    11.5    10.0    11.0     10.5 
28                                        EBIT margin     4.5     9.1     7.8     8.6      7.9 
29                                           Tax rate    24.0    22.2    23.2    25.8     22.5 
30                                   Return on equity    12.2    23.5    18.0    19.5     16.9 
31                                       Equity ratio    51.9    51.2    47.5    45.3     40.0 
32  Return on invested capital, 12 months trailing...    16.2    33.0    31.4    31.2     20.9 
33               Net working capital in proportion to     NaN     NaN     NaN     NaN      NaN 
34                         12 months trailing revenue    11.6    11.8    10.2    12.3     11.5 
35                                    Cash conversion     0.7     1.0     1.2     0.8      0.3 
36                                  Financial gearing     2.4     3.4     9.3     6.3     14.6 
37                                   INCOME STATEMENT     NaN     NaN     NaN     NaN      NaN 
38                                            Revenue   2,749   2,665   2,638   2,563    2,424 
39                                       Gross profit   1,519   1,513   1,446   1,470    1,371 
40                                                NaN     NaN     NaN     NaN     NaN      NaN 
41                                 SHARE-BASED RATIOS     NaN     NaN     NaN     NaN      NaN 
42                 Average number of shares excluding     NaN     NaN     NaN     NaN      NaN 
43               treasury shares, diluted (thousands)  16,639  16,678  16,550  16,447   16,402 
44                      Share price, end of year, DKK   140.0   172.0   187.5   185.5    122.0 
45                            Earnings per share, DKK     5.3    11.6     8.5     9.9      0.1 
46                    Diluted earnings per share, DKK     5.3    11.6     8.5     9.9      0.1 
47                   Diluted cash flow per share, DKK    10.5    11.0    13.7    18.2     14.2 
48             Diluted net asset value per share, DKK    42.9    44.0    53.1    50.3     49.1 
49                        Diluted price/earnings, DKK    26.4    14.8    22.1    18.7  1,220.0 
50                                                NaN     NaN     NaN     NaN     NaN      NaN 
51                                          EMPLOYEES     NaN     NaN     NaN     NaN      NaN 
52  Number of employees, calculated as FTEs, end o...   1,186   1,146   1,042   1,047    1,264 
53                      NUMBER OF STORES (OWN STORES)     NaN     NaN     NaN     NaN      NaN 
54                                      Retail stores     126     115      95     107      102
55                                        Concessions      43      42      42      41       42]

It ignores the first lines, what am I doing wrong ?

Here is the link to dl the pdf to test on page 3.

CodePudding user response:

In my opinion, Camelot gets a good result using stream flavor.

import camelot
tables=camelot.read_pdf(YOUR-PDF-PATH, pages='3', flavor='stream')

print(tables[0].df) gives:

0                                         DKK million  2016/17  2015/16  2014/15  2013/14  2012/131)
1                                    INCOME STATEMENT                                               
2                                             Revenue    2,749    2,665    2,638    2,563      2,424
3                                        Gross profit    1,519    1,513    1,446    1,470      1,371
4            Operating profit before depreciation and                                               
5                               amortization (EBITDA)      205      306      263      284        255
6                             Operating profit (EBIT)      125      243      207      221        191
7                                      Net financials      (3)      (7)      (8)      (5)       (13)
8                      Profit for the year before tax      122      247      201      216        178
9        Profit for the year of continuing operations       92      192      154      160        138
10  Profit/loss for the year of discontinued opera...        -        3     (14)        5      (134)
11                                Profit for the year       92      195      140      165          4
12                    STATEMENT OF FINANCIAL POSITION                                               
13                                       Total assets    1,393    1,444    1,852    1,854      2,022
14        Average invested capital including goodwill      772      736      659      708        914
15                                Net working capital      318      314      268      314        279
16                                       Total equity      723      740      884      833        809
17                           Non-controlling interest       10        7        5        4          4
18             Net interest-bearing debt, end of year       17       25       82       52        118
19                            STATEMENT OF CASH FLOWS                                               
20                Cash flow from operating activities      175      183      226      264        232
21                Cash flow from investing activities     (88)       55       15     (91)      (167)
22       Investments in property, plant and equipment     (72)     (81)     (45)     (77)       (58)
23                                     Free cash flow       87      238      241      173         65
24                Cash flow from financing activities     (79)    (319)    (172)    (109)       (35)
25                         Net cash flow for the year        8     (81)       69       64         30
26                                     KEY RATIOS (%)                                               
27                                     Revenue growth      3.2      1.0      2.9      5.7        5.5
28                                       Gross margin     55.3     56.8     54.8     57.3       56.6
29                                         Cost ratio     50.7     47.7     47.0     49.3       48.7
30                                      EBITDA margin      7.5     11.5     10.0     11.0       10.5
31                                        EBIT margin      4.5      9.1      7.8      8.6        7.9
32                                           Tax rate     24.0     22.2     23.2     25.8       22.5
33                                   Return on equity     12.2     23.5     18.0     19.5       16.9
34                                       Equity ratio     51.9     51.2     47.5     45.3       40.0
35  Return on invested capital, 12 months trailing...     16.2     33.0     31.4     31.2       20.9
36               Net working capital in proportion to                                               
37                         12 months trailing revenue     11.6     11.8     10.2     12.3       11.5
38                                    Cash conversion      0.7      1.0      1.2      0.8        0.3
39                                  Financial gearing      2.4      3.4      9.3      6.3       14.6
40                                 SHARE-BASED RATIOS                                               
41                 Average number of shares excluding                                               
42               treasury shares, diluted (thousands)   16,639   16,678   16,550   16,447     16,402
43                      Share price, end of year, DKK    140.0    172.0    187.5    185.5      122.0
44                            Earnings per share, DKK      5.3     11.6      8.5      9.9        0.1
45                    Diluted earnings per share, DKK      5.3     11.6      8.5      9.9        0.1
46                   Diluted cash flow per share, DKK     10.5     11.0     13.7     18.2       14.2
47             Diluted net asset value per share, DKK     42.9     44.0     53.1     50.3       49.1
48                        Diluted price/earnings, DKK     26.4     14.8     22.1     18.7    1,220.0
49                                          EMPLOYEES                                               
50  Number of employees, calculated as FTEs, end o...    1,186    1,146    1,042    1,047      1,264
51                      NUMBER OF STORES (OWN STORES)                                               
52                                      Retail stores      126      115       95      107        102
53                                        Concessions       43       42       42       41         42

​ For more information about Camelot, you can read the official documentation. In particular, the API reference can be useful to you

  • Related