Home > OS >  How to get matches from a text file using Php with a specified start and end point?
How to get matches from a text file using Php with a specified start and end point?

Time:12-07

I have a Php script which is using the preg_match_all function to return all the matches from a text file. But, I want the function to only check for a match starting at position 3 with a length of 11 digits (basically, with ending position at 13) in each line instead of looking for the match in the entire line as that will return false results.

Script:

<?php
$file = 'masterfile.out';
$searchfor = '02354098780';

// the following line prevents the browser from parsing this as HTML.
header('Content-Type: text/plain');

// get the file contents, assuming the file to be readable (and exist)
$contents = file_get_contents($file);
// escape special characters in the query
$pattern = preg_quote($searchfor, '/');
// finalise the regular expression, matching the whole line
$pattern = "/^.*$pattern.*\$/m";

// search, and store all matching occurrences in $matches
if(preg_match_all($pattern, $contents, $matches)){
   echo "Found matches:\n";
   echo substr(implode("\n", $matches[0]),2,11);
   echo substr(implode("\n", $matches[0]),166,11);
}
else{
   echo "No matches found";
}
?> 

Text file sample data:

I0023540987805R01  ABC                         GHI                       OLirrt                 000000000000000100EA 0812160070451700   1098833   1990041300000001086000000000108600000000000996000000000032100000000000000000000000000000000000000000000000000000000000000000000006589000000000000000                                     P0012B    
 0000002032902R01  DEF                         JKL                       KLijuI                 000000000000000100EA 0812160070451700   1029132   1997010800000002396000000000239600000120002326000000000000000000000000000000000000000000000000000000000000004560000000000000000000000000987600000000                                     A203SD   

CodePudding user response:

For a small number of characters, you can anchor the regular expression to the beginning of a line:

'#^..([0-9]{13})#'

will search for 13 digits, ignoring the first two characters (.) from the beginning (^) of the line, and including the third.

In this case:

<?php
$file      = 'masterfile.out';
// $pattern   = '#^..([0-9]{11})#m'; // Any 11 digits
$pattern   = '#^..(02354098780)#m';   // Exactly these 11

// the following line prevents the browser from parsing this
// as HTML.
header('Content-Type: text/plain');

// get the file contents, assuming the file to be readable (and exists)
$contents = file_get_contents($file);

if (preg_match_all($pattern, $contents, $matches, PREG_PATTERN_ORDER)){
   echo "Found matches:\n";
   echo implode("\n", $matches[1]);
   echo "\n";
} else {
   echo "No matches found\n";
}

update

I've just noticed that your sequence starts on the third character beginning at 1. In some standards (and in my early example) you start counting at 0. So if you start from 1, you need only two dots, not three. In other words, when you say "starting at position 3", you probably mean to skip the first two characters, while - as you can see from the other answers - almost everyone assumed you wanted to skip three characters.

CodePudding user response:

If your example is close to your intended use, you're essentially searching for an exact match of a substring, but using preg_match_all. However, iterating over the lines should have lower memory-impact, and strict substring-comparison for exact equality has lower cpu-impact than preg_match_all.

So I would recommend doing that. This can be achieved either with fgetsor with stream_get_line, which might be slightly more performant (though that should not matter in most contexts).

This can be achieved as follows:

$searchString = 'someFixedString';
$posOffset = 2;
$matchLength = mb_strlen($searchString);
$filePath = '/some/file.path';
$fileHandle = @fopen($filePath, 'r ');
$checkedLines = 0;
$matches = [];
$foundMatches = false;

//Depending on what you wish to output
$capturePosOffset = 0;
$captureLength = $matchLength   $posOffset   3;

// if lines are  no longer than 8192 bytes,
// otherwise set to a value above the byte-length of your lines
$maxBytesToReadPerLine = 0; 

// if file line-terminator is as in PHP, 
// otherwise set to file's line-terminator
$lineTerminator = PHP_EOL;

if ($fileHandle) {
    while (!feof($fileHandle)) {
       $checkedLines  ;
       // or just use fgets, which requires no further arguments
       $line = stream_get_line($fileHandle, $maxBytesToReadPerLine, $lineTerminator);
       if (mb_substr($line, $posOffset, $matchLength) === $searchString) { 
           $foundMatches = true;
           $matches[] = $line;
           // Or, if you want to capture a field with a fixed Length
           // (modify the offset and length arguments above)
           $matches[] = mb_substr($line, $capturePosOffset, $captureLength);
       }
    }
}
if ($foundMatches) {
    echo "Found " . count($matches) . " matches among $checkedLines lines:" . PHP_EOL;
    foreach ($matches as $matchedValue) {
        // I'm not sure what you intend to do here.
        // - In your example code, it appears you
        // implode the array, but then only output
        // 11 characters of the first line starting at position 3.
        // If you want the whole line, you can capture it above
        // and echo it here.

        // Or if you want, you can capture and output the first field
        // by modifying $capturePosOffset and $captureLength
        // by merely echoing the value (and a newline)
        echo '  ' . $matchedValue . PHP_EOL;
    }
} else {
    echo "No matches found!" . PHP_EOL;
}

We use mb_strlen and mb_substr in case the encoding allows for multi-byte characters - only if you know that this is definitely not the case can strlen and substr be safely used.

One ought not to get bogged down in premature optimization, but just as a note: which solution is the most optimal will depend heavily on the file-size and the match-length.

CodePudding user response:

The following regular expression ignores the first 3 characters at the beginning of each line and captures the subsequent 11

https://regex101.com/r/MEaB67/1

/^.{3}(.{11})/gm

EDIT

Here is some sample PHP code to test the regular expression

<pre>
<?php
$pattern = '/^.{3}(.{11})/m';
$subject = '
I0023540987805R01  ABC                         GHI                       OLirrt                 000000000000000100EA 0812160070451700   1098833   1990041300000001086000000000108600000000000996000000000032100000000000000000000000000000000000000000000000000000000000000000000006589000000000000000                                     P0012B    
 0000002032902R01  DEF                         JKL                       KLijuI                 000000000000000100EA 0812160070451700   1029132   1997010800000002396000000000239600000120002326000000000000000000000000000000000000000000000000000000000000004560000000000000000000000000987600000000                                     A203SD   
';
$matches = null;
preg_match_all($pattern, $subject, $matches);
var_dump($matches);
?>
</pre>

Fabio

CodePudding user response:

Here is a different approach from yours a bit - Since we are looking for a string in a specific part of the line we can remove the rest and check if the string appears in said line.

    <?php


$text = "I0023540987805R01  ABC                         GHI                       OLirrt                 000000000000000100EA 0812160070451700   1098833   1990041300000001086000000000108600000000000996000000000032100000000000000000000000000000000000000000000000000000000000000000000006589000000000000000                                     P0012B    
0000002032902R01  DEF                         JKL                       KLijuI                 000000000000000100EA 0812160070451700   1029132   1997010800000002396000000000239600000120002326000000000000000000000000000000000000000000000000000000000000004560000000000000000000000000987600000000                                     A203SD   ";

echo '<pre>';
$txt = explode("\n",$text);

echo '<pre>';
print_r($txt);

foreach($txt as $key => $line){
    $subbedString = substr($line,2,11);

    $searchfor = '02354098780';
    //echo strpos($subbedString,$searchfor); 
    if(strpos($subbedString,$searchfor) === 0){
        $matches[$key] = $searchfor;
        $matchesLine[$key] = $line; /**Save the whole line when match is found. */
        echo "Found in line : $key";
    }

    
}

echo '<pre>';
print_r($matches);

echo '<pre>';
print_r($matchesLine);

Will return:

  Array
(
    [0] => I0023540987805R01  ABC                         GHI                       OLirrt                 000000000000000100EA 0812160070451700   1098833   1990041300000001086000000000108600000000000996000000000032100000000000000000000000000000000000000000000000000000000000000000000006589000000000000000                                     P0012B    
    [1] => 0000002032902R01  DEF                         JKL                       KLijuI                 000000000000000100EA 0812160070451700   1029132   1997010800000002396000000000239600000120002326000000000000000000000000000000000000000000000000000000000000004560000000000000000000000000987600000000                                     A203SD   
)
Found in line : 0
Array
(
    [0] => 02354098780
)
Array
(
    [0] => I0023540987805R01  ABC                         GHI                       OLirrt                 000000000000000100EA 0812160070451700   1098833   1990041300000001086000000000108600000000000996000000000032100000000000000000000000000000000000000000000000000000000000000000000006589000000000000000                                     P0012B    
)

CodePudding user response:

You can match 3 characters, then use \K to forget what is matched so far and then match 11 digits.

^...\K\d{11}
  • ^ Start of string
  • ... Match 3 times any char except a newline
  • \K Clear the current match buffer
  • \d{11} Match 11 digits

You can omit using preg_quote as there is nothing to escape in the current pattern.

As the pattern uses an anchor ^ you have to specify the multiline flag /m to get all the results.

$file = 'masterfile.out';
$contents = file_get_contents($file);
$pattern = "/^...\K\d{11}/m";

if (preg_match_all($pattern, $contents, $matches)) {
    echo "Found matches:" . PHP_EOL;
    foreach ($matches[0] as $m) {
        echo $m . PHP_EOL;
    }
} else {
    echo "No matches found";
}

Output

Found matches:
23540987805
00002032902
  • Related