Home > Back-end >  PHP to Matching lines between files
PHP to Matching lines between files


Text File 1:


Text File 2:


If the lines in the 2nd text file exist in the 1st text file, I want it to show me all the matching lines from the 1st file



CodePudding user response:

You can try to use preg_match function to find all strings

    $file1 = file_get_contents('text1.txt');
    $file2 = file_get_contents('text2.txt');
    $file2 = explode("\r\n", $file2);
     foreach($file2 as $item){
       preg_match_all('#'.$item.'. #', $file1, $matches);
       $result[] = $matches;


    [0] => Array
            [0] => Array
                    [0] => 521307101305xxxx|xx|xxxx|xxx
                    [1] => 521307101485xxxx|xx|xxxx|xxx


    [1] => Array
            [0] => Array
                    [0] => 407166210197xxxx|xx|xxxx|xxx



but I think, what use preg_match to find a string, it's not best solution

CodePudding user response:

This can be a tricky problem to solve. If you are dealing with large files, or don't know how large your files will be, you have to find a way to solve the problem without reading either file into memory all at once.

In order to do that, you need to create some sort of efficient structure for the data in file 1 that you can search for each ID in file 2, that also allows you to retrieve the full records for file 1 after you have determined the matches. This is exactly what trees are made for.

Here is a solution that reads in the data in file 1, creates a tree structure from the first column of each row, and keeps track of the byte offsets from the file where the strings appear. This allows you to search using any length of ID prefix (searching with "4" would return the first two lines, "40" only the second).

There are two classes, CharNode represents a single node in the tree, and IdTree, which manages the structure of nodes, handles ingesting the files, and searching.


class CharNode implements JsonSerializable
    private string $char;
    private array  $byteOffsets = [];
    private array  $children    = [];
     * CharNode constructor.
     * @param string $char
     * @param bool $terminal
    public function __construct(string $char)
        $this->char = $char;
     * @return string
    public function getChar(): string
        return $this->char;
     * @param string $char
    public function setChar(string $char): void
        $this->char = $char;
     * @return array
    public function getChildren(): array
        return $this->children;
     * @param array $children
    public function setChildren(array $children): void
        $this->children = $children;
     * @return array
    public function getByteOffsets(): array
        return $this->byteOffsets;
     * @param array $byteOffsets
    public function setByteOffsets(array $byteOffsets): void
        $this->byteOffsets = $byteOffsets;
     * @param int $byteOffset
     * @return void
    public function addByteOffset(int $byteOffset): void
        $this->byteOffsets[] = $byteOffset;
     * @param array $charVector
     * @param int $byteOffset
     * @return void
    public function ingestCharVector(array $charVector, int $byteOffset)
        $char = array_shift($charVector);
        if (!array_key_exists($char, $this->children))
            $newNode               = new CharNode($char);
            $this->children[$char] = $newNode;
        $currChild = $this->children[$char];
        if (!empty($charVector))
            $currChild->ingestCharVector($charVector, $byteOffset);
    public function jsonSerialize()
        return [
            'char'        => $this->char,
            'byteOffsets' => $this->byteOffsets,
            'children'    => array_values($this->children)

class IdTree implements JsonSerializable
    private array  $tree             = [];
    private array  $byteOffsetOutput = [];
    private string $filePath;
     * @param string $filePath
     * @param string $delimiter
     * @throws Exception
    public function __construct(string $filePath, string $delimiter = '|')
        $this->filePath = $filePath;
        $fh = fopen($filePath, 'r');
        if (!$fh)
            throw new Exception('Could not open file ' . $filePath);
        $currByteOffset = 0;
        while (($currRow = fgetcsv($fh, null, $delimiter)))
            $this->ingestWord($currRow[0], $currByteOffset);
            $currByteOffset = ftell($fh);
     * @param string $idFilePath
     * @return array
     * @throws Exception
    public function getByteOffsetsForIdFile(string $idFilePath): array
        $byteOffsets = [];
        $fh          = fopen($idFilePath, 'r');
        if (!$fh)
            throw new Exception('Could not open file ' . $idFilePath);
        while (($currLine = fgets($fh)))
            $currByteOffsets = $this->findByteOffsetsForId(trim($currLine));
            $byteOffsets     = array_merge($byteOffsets, $currByteOffsets);
        return $byteOffsets;
     * @param string $idFilePath
     * @param bool $firstColumnOnly
     * @return array
     * @throws Exception
    public function getLinesMatchingIdFile(string $idFilePath, bool $firstColumnOnly = false): array
        $byteOffsets = $this->getByteOffsetsForIdFile($idFilePath);
        $fh          = fopen($this->filePath, 'r');
        $output = [];
        foreach ($byteOffsets as $currOffset)
            fseek($fh, $currOffset);
            $currRow = fgetcsv($fh, null, '|');
            $output[] = ($firstColumnOnly) ? $currRow[0] : $currRow;
        return $output;
    public function ingestWord(string $word, int $byteOffset): void
        $word = $this->formatWord($word);
        if (empty($word))
        $charVector = str_split($word, 1);
        $this->ingestCharVector($charVector, $byteOffset);
     * @param array $charVector
     * @param int $byteOffset
     * @return void
    public function ingestCharVector(array $charVector, int $byteOffset): void
        $char = array_shift($charVector);
        if (!array_key_exists($char, $this->tree))
            $this->tree[$char] = new CharNode($char);
        $currChild = $this->tree[$char];
        if (!empty($charVector))
            $currChild->ingestCharVector($charVector, $byteOffset);
     * @param string $term
     * @return array
    public function findByteOffsetsForId(string $term): array
        // Reset state
        $this->byteOffsetOutput = [];
        $this->stringBuffer     = [];
        $word = $this->formatWord($term);
        if (empty($word))
            return [];
        $charVector = str_split($word, 1);
        $this->branchSearch($charVector, $this->tree);
        return $this->byteOffsetOutput;
     * @param array $charVector
     * @param array $charNodeSet
     * @return void
    private function branchSearch(array $charVector, array $charNodeSet): void
        if (empty($charNodeSet))
        if (!empty($charVector))
            $currChar = array_shift($charVector);
            if (!array_key_exists($currChar, $charNodeSet))
             * @var $currCharNode CharNode
            $currCharNode = $charNodeSet[$currChar];
            // If this is the end of the search term, set th eline numbers
            if (empty($charVector))
                $this->byteOffsetOutput = array_merge($this->byteOffsetOutput, $currCharNode->getByteOffsets());
            $this->branchSearch($charVector, $currCharNode->getChildren());
     * @param string $word
     * @return array|string|string[]|null
    private function formatWord(string $word)
        $word = strtolower($word);

        $word = preg_replace("/[^a-z0-9 ]/", '', $word);
        return $word;
    public function jsonSerialize()
        return array_values($this->tree);

That looks like a lot of code, but most of it is fairly idiomatic tree logic.

Using it is dead simple:

// Instantiate and load our tree
$tree = new IdTree('file1.txt');

// Get all matching rows
$matchingRows = $tree->getLinesMatchingIdFile('file2.txt');


    [0] => Array
            [0] => 407166210197xxxx
            [1] => xx
            [2] => xxxx
            [3] => xxx

    [1] => Array
            [0] => 521307101305xxxx
            [1] => xx
            [2] => xxxx
            [3] => xxx

    [2] => Array
            [0] => 521307101485xxxx
            [1] => xx
            [2] => xxxx
            [3] => xxx


It was not clear to me whether you wanted the entire row for each match, or just the first column, so I added a flag that allows that.

// Get only the first column of each line
$matchingIds = $tree->getLinesMatchingIdFile('file2.txt', true);


    [0] => 407166210197xxxx
    [1] => 521307101305xxxx
    [2] => 521307101485xxxx

There is some extra stuff you may not need, like the JSON output, which is useful for visualizing how the structure works. You could also make this more efficient if you know your data will always be formatted in certain ways (if your search IDs will always be within certain lengths, etc). You could still run into memory problems if you are processing truly massive data files. This is just a basic example of how you can go about solving problems like this "for real".

  • Related