Home > Software engineering >  php replace function for large data dump from pdf
php replace function for large data dump from pdf

Time:11-23

I'm looking to cipher through a large output of text and insert css styling to the text. The columns remain the same each time, however the initial start of what I am trying to gather varies numbering from 1-50 sometimes it's 1-13 sometimes it's all 1-50.

I have tried the replace method for each number but I am not able to replace each variable to make it easily interpreted for integration.

The code I have is:

$text = $page->getText();
$search = array('1.  ');
$replace = array('<br />1. ');
$result = str_replace($search, $replace, $text);
echo $result;

The output before formatting is:

DESCRIPTION QUANTITY UNIT PRICE TAX RCV DEPREC. ACV 1. Remove 3 tab - 25 yr. - composition shingle roofing - 21.25 SQ 52.83 0.00 1,122.64 (0.00) 1,122.64 incl. felt 2. 3 tab - 25 yr. - composition shingle roofing - incl. felt 24.67 SQ 186.84 199.43 4,808.77 (0.00) 4,808.77 3. R&R Drip edge 196.95 LF 2.33 16.27 475.16 (0.00) 475.16 4. Asphalt starter - universal starter course 100.58 LF 1.31 3.67 135.43 (0.00) 135.43 5. Ice & water barrier 118.94 SF 1.15 4.34 141.12 (0.00) 141.12 6. R&R Roof vent - turtle type - Metal 5.00 EA 51.09 6.55 262.00 (0.00) 262.00 7. R&R Flashing - pipe jack 3.00 EA 38.63 2.95 118.84 (0.00) 118.84 8. R&R Rain cap - 6" 2.00 EA 36.80 3.40 77.00 (0.00) 77.00 9. R&R Ridge cap - composition shingles 77.58 LF 4.67 7.04 369.33 (0.00) 369.33 10. Digital satellite system - Detach & reset 1.00 EA 31.40 0.00 31.40 (0.00) 31.40 11. Digital satellite system - alignment and calibration 1.00 EA 94.19 0.00 94.19 (0.00) 94.19 only 12. R&R Flashing - pipe jack - split boot 1.00 EA 76.00 3.76 79.76 (0.00) 79.76 13. R&R Gutter / downspout - aluminum - up to 5" 152.58 LF 7.13 49.56 1,137.45 (0.00) 1,137.45

I'm looking for it to translate into a table as follow for each line (1-~50):

<table>
<tr>
<th>DESCRIPTION</th>
<th>QUANTITY</th>
<th>UNIT PRICE</th>
<th>TAX</th>
<th>RCV</th>
<th>DEPREC.</th>
<th>ACV</th>
</tr>
<tr>
<td>1. Remove 3 tab - 25 yr. - composition shingle roofing - incl. felt</td>
<td>21.25 SQ </td>
<td>52.83</td>
<td>0.00</td>
<td>1,122.64</td>
<td>(0.00)</td>
<td>1,122.64</td>
</tr>
</table>

Is there a way that I can complete this using javascript or replace within a replace?

CodePudding user response:

If this data is a real-world sample, you should be able to spit on one or more digits followed by a period, surrounded by whitespace on both sides. That will at least give you your individual lines.

<?php

$text = <<<TAG
DESCRIPTION QUANTITY UNIT PRICE TAX RCV DEPREC. ACV 1. Remove 3 tab - 25 yr. - composition shingle roofing - 21.25 SQ 52.83 0.00 1,122.64 (0.00) 1,122.64 incl. felt 2. 3 tab - 25 yr. - composition shingle roofing - incl. felt 24.67 SQ 186.84 199.43 4,808.77 (0.00) 4,808.77 3. R&R Drip edge 196.95 LF 2.33 16.27 475.16 (0.00) 475.16 4. Asphalt starter - universal starter course 100.58 LF 1.31 3.67 135.43 (0.00) 135.43 5. Ice & water barrier 118.94 SF 1.15 4.34 141.12 (0.00) 141.12 6. R&R Roof vent - turtle type - Metal 5.00 EA 51.09 6.55 262.00 (0.00) 262.00 7. R&R Flashing - pipe jack 3.00 EA 38.63 2.95 118.84 (0.00) 118.84 8. R&R Rain cap - 6" 2.00 EA 36.80 3.40 77.00 (0.00) 77.00 9. R&R Ridge cap - composition shingles 77.58 LF 4.67 7.04 369.33 (0.00) 369.33 10. Digital satellite system - Detach & reset 1.00 EA 31.40 0.00 31.40 (0.00) 31.40 11. Digital satellite system - alignment and calibration 1.00 EA 94.19 0.00 94.19 (0.00) 94.19 only 12. R&R Flashing - pipe jack - split boot 1.00 EA 76.00 3.76 79.76 (0.00) 79.76 13. R&R Gutter / downspout - aluminum - up to 5" 152.58 LF 7.13 49.56 1,137.45 (0.00) 1,137.45
TAG;

var_dump(explode(PHP_EOL, preg_replace('/\s (\d \.)\s /', PHP_EOL . '$1 ', $text)));

This outputs:

array (
  0 => 'DESCRIPTION QUANTITY UNIT PRICE TAX RCV DEPREC. ACV',
  1 => '1. Remove 3 tab - 25 yr. - composition shingle roofing - 21.25 SQ 52.83 0.00 1,122.64 (0.00) 1,122.64 incl. felt',
  2 => '2. 3 tab - 25 yr. - composition shingle roofing - incl. felt 24.67 SQ 186.84 199.43 4,808.77 (0.00) 4,808.77',
  3 => '3. R&R Drip edge 196.95 LF 2.33 16.27 475.16 (0.00) 475.16',
  4 => '4. Asphalt starter - universal starter course 100.58 LF 1.31 3.67 135.43 (0.00) 135.43',
  5 => '5. Ice & water barrier 118.94 SF 1.15 4.34 141.12 (0.00) 141.12',
  6 => '6. R&R Roof vent - turtle type - Metal 5.00 EA 51.09 6.55 262.00 (0.00) 262.00',
  7 => '7. R&R Flashing - pipe jack 3.00 EA 38.63 2.95 118.84 (0.00) 118.84',
  8 => '8. R&R Rain cap - 6" 2.00 EA 36.80 3.40 77.00 (0.00) 77.00',
  9 => '9. R&R Ridge cap - composition shingles 77.58 LF 4.67 7.04 369.33 (0.00) 369.33',
  10 => '10. Digital satellite system - Detach & reset 1.00 EA 31.40 0.00 31.40 (0.00) 31.40',
  11 => '11. Digital satellite system - alignment and calibration 1.00 EA 94.19 0.00 94.19 (0.00) 94.19 only',
  12 => '12. R&R Flashing - pipe jack - split boot 1.00 EA 76.00 3.76 79.76 (0.00) 79.76',
  13 => '13. R&R Gutter / downspout - aluminum - up to 5" 152.58 LF 7.13 49.56 1,137.45 (0.00) 1,137.45',
)

Demo here: https://3v4l.org/n01E5

From there, you are going to need to perform some additional rules, and I think you might end up writing a parser unless your data exactly matches this every time.

You might want to look at extracting text with formatting, or possibly trying to extract it as tabular data. Unfortunately, both of those examples are in C#.

  • Related