I have a Php script which is using the preg_match_all
function to return all the matches from a text file. But, I want the function to only check for a match starting at position 3 with a length of 11 digits (basically, with ending position at 13) in each line instead of looking for the match in the entire line as that will return false results.
Script:
<?php
$file = 'masterfile.out';
$searchfor = '02354098780';
// the following line prevents the browser from parsing this as HTML.
header('Content-Type: text/plain');
// get the file contents, assuming the file to be readable (and exist)
$contents = file_get_contents($file);
// escape special characters in the query
$pattern = preg_quote($searchfor, '/');
// finalise the regular expression, matching the whole line
$pattern = "/^.*$pattern.*\$/m";
// search, and store all matching occurrences in $matches
if(preg_match_all($pattern, $contents, $matches)){
echo "Found matches:\n";
echo substr(implode("\n", $matches[0]),2,11);
echo substr(implode("\n", $matches[0]),166,11);
}
else{
echo "No matches found";
}
?>
Text file sample data:
I0023540987805R01 ABC GHI OLirrt 000000000000000100EA 0812160070451700 1098833 1990041300000001086000000000108600000000000996000000000032100000000000000000000000000000000000000000000000000000000000000000000006589000000000000000 P0012B
0000002032902R01 DEF JKL KLijuI 000000000000000100EA 0812160070451700 1029132 1997010800000002396000000000239600000120002326000000000000000000000000000000000000000000000000000000000000004560000000000000000000000000987600000000 A203SD
CodePudding user response:
For a small number of characters, you can anchor the regular expression to the beginning of a line:
'#^..([0-9]{13})#'
will search for 13 digits, ignoring the first two characters (.) from the beginning (^) of the line, and including the third.
In this case:
<?php
$file = 'masterfile.out';
// $pattern = '#^..([0-9]{11})#m'; // Any 11 digits
$pattern = '#^..(02354098780)#m'; // Exactly these 11
// the following line prevents the browser from parsing this
// as HTML.
header('Content-Type: text/plain');
// get the file contents, assuming the file to be readable (and exists)
$contents = file_get_contents($file);
if (preg_match_all($pattern, $contents, $matches, PREG_PATTERN_ORDER)){
echo "Found matches:\n";
echo implode("\n", $matches[1]);
echo "\n";
} else {
echo "No matches found\n";
}
update
I've just noticed that your sequence starts on the third character beginning at 1. In some standards (and in my early example) you start counting at 0. So if you start from 1, you need only two dots, not three. In other words, when you say "starting at position 3", you probably mean to skip the first two characters, while - as you can see from the other answers - almost everyone assumed you wanted to skip three characters.
CodePudding user response:
If your example is close to your intended use, you're essentially searching for an exact match of a substring, but using preg_match_all. However, iterating over the lines should have lower memory-impact, and strict substring-comparison for exact equality has lower cpu-impact than preg_match_all.
So I would recommend doing that. This can be achieved either with fgets
or with stream_get_line
, which might be slightly more performant (though that should not matter in most contexts).
This can be achieved as follows:
$searchString = 'someFixedString';
$posOffset = 2;
$matchLength = mb_strlen($searchString);
$filePath = '/some/file.path';
$fileHandle = @fopen($filePath, 'r ');
$checkedLines = 0;
$matches = [];
$foundMatches = false;
//Depending on what you wish to output
$capturePosOffset = 0;
$captureLength = $matchLength $posOffset 3;
// if lines are no longer than 8192 bytes,
// otherwise set to a value above the byte-length of your lines
$maxBytesToReadPerLine = 0;
// if file line-terminator is as in PHP,
// otherwise set to file's line-terminator
$lineTerminator = PHP_EOL;
if ($fileHandle) {
while (!feof($fileHandle)) {
$checkedLines ;
// or just use fgets, which requires no further arguments
$line = stream_get_line($fileHandle, $maxBytesToReadPerLine, $lineTerminator);
if (mb_substr($line, $posOffset, $matchLength) === $searchString) {
$foundMatches = true;
$matches[] = $line;
// Or, if you want to capture a field with a fixed Length
// (modify the offset and length arguments above)
$matches[] = mb_substr($line, $capturePosOffset, $captureLength);
}
}
}
if ($foundMatches) {
echo "Found " . count($matches) . " matches among $checkedLines lines:" . PHP_EOL;
foreach ($matches as $matchedValue) {
// I'm not sure what you intend to do here.
// - In your example code, it appears you
// implode the array, but then only output
// 11 characters of the first line starting at position 3.
// If you want the whole line, you can capture it above
// and echo it here.
// Or if you want, you can capture and output the first field
// by modifying $capturePosOffset and $captureLength
// by merely echoing the value (and a newline)
echo ' ' . $matchedValue . PHP_EOL;
}
} else {
echo "No matches found!" . PHP_EOL;
}
We use mb_strlen
and mb_substr
in case the encoding allows for multi-byte characters - only if you know that this is definitely not the case can strlen
and substr
be safely used.
One ought not to get bogged down in premature optimization, but just as a note: which solution is the most optimal will depend heavily on the file-size and the match-length.
CodePudding user response:
The following regular expression ignores the first 3 characters at the beginning of each line and captures the subsequent 11
https://regex101.com/r/MEaB67/1
/^.{3}(.{11})/gm
EDIT
Here is some sample PHP code to test the regular expression
<pre>
<?php
$pattern = '/^.{3}(.{11})/m';
$subject = '
I0023540987805R01 ABC GHI OLirrt 000000000000000100EA 0812160070451700 1098833 1990041300000001086000000000108600000000000996000000000032100000000000000000000000000000000000000000000000000000000000000000000006589000000000000000 P0012B
0000002032902R01 DEF JKL KLijuI 000000000000000100EA 0812160070451700 1029132 1997010800000002396000000000239600000120002326000000000000000000000000000000000000000000000000000000000000004560000000000000000000000000987600000000 A203SD
';
$matches = null;
preg_match_all($pattern, $subject, $matches);
var_dump($matches);
?>
</pre>
Fabio
CodePudding user response:
Here is a different approach from yours a bit - Since we are looking for a string in a specific part of the line we can remove the rest and check if the string appears in said line.
<?php
$text = "I0023540987805R01 ABC GHI OLirrt 000000000000000100EA 0812160070451700 1098833 1990041300000001086000000000108600000000000996000000000032100000000000000000000000000000000000000000000000000000000000000000000006589000000000000000 P0012B
0000002032902R01 DEF JKL KLijuI 000000000000000100EA 0812160070451700 1029132 1997010800000002396000000000239600000120002326000000000000000000000000000000000000000000000000000000000000004560000000000000000000000000987600000000 A203SD ";
echo '<pre>';
$txt = explode("\n",$text);
echo '<pre>';
print_r($txt);
foreach($txt as $key => $line){
$subbedString = substr($line,2,11);
$searchfor = '02354098780';
//echo strpos($subbedString,$searchfor);
if(strpos($subbedString,$searchfor) === 0){
$matches[$key] = $searchfor;
$matchesLine[$key] = $line; /**Save the whole line when match is found. */
echo "Found in line : $key";
}
}
echo '<pre>';
print_r($matches);
echo '<pre>';
print_r($matchesLine);
Will return:
Array
(
[0] => I0023540987805R01 ABC GHI OLirrt 000000000000000100EA 0812160070451700 1098833 1990041300000001086000000000108600000000000996000000000032100000000000000000000000000000000000000000000000000000000000000000000006589000000000000000 P0012B
[1] => 0000002032902R01 DEF JKL KLijuI 000000000000000100EA 0812160070451700 1029132 1997010800000002396000000000239600000120002326000000000000000000000000000000000000000000000000000000000000004560000000000000000000000000987600000000 A203SD
)
Found in line : 0
Array
(
[0] => 02354098780
)
Array
(
[0] => I0023540987805R01 ABC GHI OLirrt 000000000000000100EA 0812160070451700 1098833 1990041300000001086000000000108600000000000996000000000032100000000000000000000000000000000000000000000000000000000000000000000006589000000000000000 P0012B
)
CodePudding user response:
You can match 3 characters, then use \K
to forget what is matched so far and then match 11 digits.
^...\K\d{11}
^
Start of string...
Match 3 times any char except a newline\K
Clear the current match buffer\d{11}
Match 11 digits
You can omit using preg_quote
as there is nothing to escape in the current pattern.
As the pattern uses an anchor ^
you have to specify the multiline flag /m
to get all the results.
$file = 'masterfile.out';
$contents = file_get_contents($file);
$pattern = "/^...\K\d{11}/m";
if (preg_match_all($pattern, $contents, $matches)) {
echo "Found matches:" . PHP_EOL;
foreach ($matches[0] as $m) {
echo $m . PHP_EOL;
}
} else {
echo "No matches found";
}
Output
Found matches:
23540987805
00002032902