Is there a way to extract data from every arrays in a pdf using python?
I've tested tabula, camelot, pdfplumber but none can extract everything or correctly.
An example:
I would like to work on these using matrix, dataframe, ...
Should I opt for OCR for better recognition ?
EDIT :
I am trying to retrieve this table from a pdf using tabula-py.
My script :
tables = tabula.read_pdf(filename, pages="3", output_format="dataframe", multiple_tables=True)
print(tables)
The output :
[ amortization (EBITDA) 205 306 263 284 255
0 Operating profit (EBIT) 125 243 207 221 191
1 Net financials (3) (7) (8) (5) (13)
2 Profit for the year before tax 122 247 201 216 178
3 Profit for the year of continuing operations 92 192 154 160 138
4 Profit/loss for the year of discontinued opera... - 3 (14) 5 (134)
5 Profit for the year 92 195 140 165 4
6 NaN NaN NaN NaN NaN NaN
7 STATEMENT OF FINANCIAL POSITION NaN NaN NaN NaN NaN
8 Total assets 1,393 1,444 1,852 1,854 2,022
9 Average invested capital including goodwill 772 736 659 708 914
10 Net working capital 318 314 268 314 279
11 Total equity 723 740 884 833 809
12 Non-controlling interest 10 7 5 4 4
13 Net interest-bearing debt, end of year 17 25 82 52 118
14 NaN NaN NaN NaN NaN NaN
15 STATEMENT OF CASH FLOWS NaN NaN NaN NaN NaN
16 Cash flow from operating activities 175 183 226 264 232
17 Cash flow from investing activities (88) 55 15 (91) (167)
18 Investments in property, plant and equipment (72) (81) (45) (77) (58)
19 Free cash flow 87 238 241 173 65
20 Cash flow from financing activities (79) (319) (172) (109) (35)
21 Net cash flow for the year 8 (81) 69 64 30
22 NaN NaN NaN NaN NaN NaN
23 KEY RATIOS (%) NaN NaN NaN NaN NaN
24 Revenue growth 3.2 1.0 2.9 5.7 5.5
25 Gross margin 55.3 56.8 54.8 57.3 56.6
26 Cost ratio 50.7 47.7 47.0 49.3 48.7
27 EBITDA margin 7.5 11.5 10.0 11.0 10.5
28 EBIT margin 4.5 9.1 7.8 8.6 7.9
29 Tax rate 24.0 22.2 23.2 25.8 22.5
30 Return on equity 12.2 23.5 18.0 19.5 16.9
31 Equity ratio 51.9 51.2 47.5 45.3 40.0
32 Return on invested capital, 12 months trailing... 16.2 33.0 31.4 31.2 20.9
33 Net working capital in proportion to NaN NaN NaN NaN NaN
34 12 months trailing revenue 11.6 11.8 10.2 12.3 11.5
35 Cash conversion 0.7 1.0 1.2 0.8 0.3
36 Financial gearing 2.4 3.4 9.3 6.3 14.6
37 INCOME STATEMENT NaN NaN NaN NaN NaN
38 Revenue 2,749 2,665 2,638 2,563 2,424
39 Gross profit 1,519 1,513 1,446 1,470 1,371
40 NaN NaN NaN NaN NaN NaN
41 SHARE-BASED RATIOS NaN NaN NaN NaN NaN
42 Average number of shares excluding NaN NaN NaN NaN NaN
43 treasury shares, diluted (thousands) 16,639 16,678 16,550 16,447 16,402
44 Share price, end of year, DKK 140.0 172.0 187.5 185.5 122.0
45 Earnings per share, DKK 5.3 11.6 8.5 9.9 0.1
46 Diluted earnings per share, DKK 5.3 11.6 8.5 9.9 0.1
47 Diluted cash flow per share, DKK 10.5 11.0 13.7 18.2 14.2
48 Diluted net asset value per share, DKK 42.9 44.0 53.1 50.3 49.1
49 Diluted price/earnings, DKK 26.4 14.8 22.1 18.7 1,220.0
50 NaN NaN NaN NaN NaN NaN
51 EMPLOYEES NaN NaN NaN NaN NaN
52 Number of employees, calculated as FTEs, end o... 1,186 1,146 1,042 1,047 1,264
53 NUMBER OF STORES (OWN STORES) NaN NaN NaN NaN NaN
54 Retail stores 126 115 95 107 102
55 Concessions 43 42 42 41 42]
It ignores the first lines, what am I doing wrong ?
Here is the link to dl the pdf to test on page 3.
CodePudding user response:
In my opinion, Camelot gets a good result using stream flavor.
import camelot
tables=camelot.read_pdf(YOUR-PDF-PATH, pages='3', flavor='stream')
print(tables[0].df)
gives:
0 DKK million 2016/17 2015/16 2014/15 2013/14 2012/131)
1 INCOME STATEMENT
2 Revenue 2,749 2,665 2,638 2,563 2,424
3 Gross profit 1,519 1,513 1,446 1,470 1,371
4 Operating profit before depreciation and
5 amortization (EBITDA) 205 306 263 284 255
6 Operating profit (EBIT) 125 243 207 221 191
7 Net financials (3) (7) (8) (5) (13)
8 Profit for the year before tax 122 247 201 216 178
9 Profit for the year of continuing operations 92 192 154 160 138
10 Profit/loss for the year of discontinued opera... - 3 (14) 5 (134)
11 Profit for the year 92 195 140 165 4
12 STATEMENT OF FINANCIAL POSITION
13 Total assets 1,393 1,444 1,852 1,854 2,022
14 Average invested capital including goodwill 772 736 659 708 914
15 Net working capital 318 314 268 314 279
16 Total equity 723 740 884 833 809
17 Non-controlling interest 10 7 5 4 4
18 Net interest-bearing debt, end of year 17 25 82 52 118
19 STATEMENT OF CASH FLOWS
20 Cash flow from operating activities 175 183 226 264 232
21 Cash flow from investing activities (88) 55 15 (91) (167)
22 Investments in property, plant and equipment (72) (81) (45) (77) (58)
23 Free cash flow 87 238 241 173 65
24 Cash flow from financing activities (79) (319) (172) (109) (35)
25 Net cash flow for the year 8 (81) 69 64 30
26 KEY RATIOS (%)
27 Revenue growth 3.2 1.0 2.9 5.7 5.5
28 Gross margin 55.3 56.8 54.8 57.3 56.6
29 Cost ratio 50.7 47.7 47.0 49.3 48.7
30 EBITDA margin 7.5 11.5 10.0 11.0 10.5
31 EBIT margin 4.5 9.1 7.8 8.6 7.9
32 Tax rate 24.0 22.2 23.2 25.8 22.5
33 Return on equity 12.2 23.5 18.0 19.5 16.9
34 Equity ratio 51.9 51.2 47.5 45.3 40.0
35 Return on invested capital, 12 months trailing... 16.2 33.0 31.4 31.2 20.9
36 Net working capital in proportion to
37 12 months trailing revenue 11.6 11.8 10.2 12.3 11.5
38 Cash conversion 0.7 1.0 1.2 0.8 0.3
39 Financial gearing 2.4 3.4 9.3 6.3 14.6
40 SHARE-BASED RATIOS
41 Average number of shares excluding
42 treasury shares, diluted (thousands) 16,639 16,678 16,550 16,447 16,402
43 Share price, end of year, DKK 140.0 172.0 187.5 185.5 122.0
44 Earnings per share, DKK 5.3 11.6 8.5 9.9 0.1
45 Diluted earnings per share, DKK 5.3 11.6 8.5 9.9 0.1
46 Diluted cash flow per share, DKK 10.5 11.0 13.7 18.2 14.2
47 Diluted net asset value per share, DKK 42.9 44.0 53.1 50.3 49.1
48 Diluted price/earnings, DKK 26.4 14.8 22.1 18.7 1,220.0
49 EMPLOYEES
50 Number of employees, calculated as FTEs, end o... 1,186 1,146 1,042 1,047 1,264
51 NUMBER OF STORES (OWN STORES)
52 Retail stores 126 115 95 107 102
53 Concessions 43 42 42 41 42
For more information about Camelot, you can read the official documentation. In particular, the API reference can be useful to you