Home > Back-end >  PHP to Matching lines between files
PHP to Matching lines between files

Time:08-28

Text File 1:

426684146543xxxx|xx|xxxx|xxx
407166210197xxxx|xx|xxxx|xxx
521307101305xxxx|xx|xxxx|xxx
521307101485xxxx|xx|xxxx|xxx

Text File 2:

521307
407166

If the lines in the 2nd text file exist in the 1st text file, I want it to show me all the matching lines from the 1st file

OUTPUT:

521307101485xxxx
521307101305xxxx
407166210197xxxx

CodePudding user response:

You can try to use preg_match function to find all strings


    $file1 = file_get_contents('text1.txt');
    $file2 = file_get_contents('text2.txt');
    $file2 = explode("\r\n", $file2);
     foreach($file2 as $item){
      
       preg_match_all('#'.$item.'. #', $file1, $matches);
       
       $result[] = $matches;
       
     }

Result:

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => 521307101305xxxx|xx|xxxx|xxx
                    [1] => 521307101485xxxx|xx|xxxx|xxx
                )

        )

    [1] => Array
        (
            [0] => Array
                (
                    [0] => 407166210197xxxx|xx|xxxx|xxx
                )

        )

)

but I think, what use preg_match to find a string, it's not best solution

CodePudding user response:

This can be a tricky problem to solve. If you are dealing with large files, or don't know how large your files will be, you have to find a way to solve the problem without reading either file into memory all at once.

In order to do that, you need to create some sort of efficient structure for the data in file 1 that you can search for each ID in file 2, that also allows you to retrieve the full records for file 1 after you have determined the matches. This is exactly what trees are made for.

Here is a solution that reads in the data in file 1, creates a tree structure from the first column of each row, and keeps track of the byte offsets from the file where the strings appear. This allows you to search using any length of ID prefix (searching with "4" would return the first two lines, "40" only the second).

There are two classes, CharNode represents a single node in the tree, and IdTree, which manages the structure of nodes, handles ingesting the files, and searching.

<?php

class CharNode implements JsonSerializable
{
    private string $char;
    private array  $byteOffsets = [];
    private array  $children    = [];
    
    /**
     * CharNode constructor.
     * @param string $char
     * @param bool $terminal
     */
    public function __construct(string $char)
    {
        $this->char = $char;
    }
    
    /**
     * @return string
     */
    public function getChar(): string
    {
        return $this->char;
    }
    
    /**
     * @param string $char
     */
    public function setChar(string $char): void
    {
        $this->char = $char;
    }
    
    /**
     * @return array
     */
    public function getChildren(): array
    {
        return $this->children;
    }
    
    /**
     * @param array $children
     */
    public function setChildren(array $children): void
    {
        $this->children = $children;
    }
    
    /**
     * @return array
     */
    public function getByteOffsets(): array
    {
        return $this->byteOffsets;
    }
    
    /**
     * @param array $byteOffsets
     */
    public function setByteOffsets(array $byteOffsets): void
    {
        $this->byteOffsets = $byteOffsets;
    }
    
    /**
     * @param int $byteOffset
     * @return void
     */
    public function addByteOffset(int $byteOffset): void
    {
        $this->byteOffsets[] = $byteOffset;
    }
    
    /**
     * @param array $charVector
     * @param int $byteOffset
     * @return void
     */
    public function ingestCharVector(array $charVector, int $byteOffset)
    {
        $char = array_shift($charVector);
        
        if (!array_key_exists($char, $this->children))
        {
            $newNode               = new CharNode($char);
            $this->children[$char] = $newNode;
        }
        
        $currChild = $this->children[$char];
        $currChild->addByteOffset($byteOffset);
        
        if (!empty($charVector))
        {
            $currChild->ingestCharVector($charVector, $byteOffset);
        }
    }
    
    public function jsonSerialize()
    {
        return [
            'char'        => $this->char,
            'byteOffsets' => $this->byteOffsets,
            'children'    => array_values($this->children)
        ];
    }
}

class IdTree implements JsonSerializable
{
    private array  $tree             = [];
    private array  $byteOffsetOutput = [];
    private string $filePath;
    
    /**
     * @param string $filePath
     * @param string $delimiter
     * @throws Exception
     */
    public function __construct(string $filePath, string $delimiter = '|')
    {
        $this->filePath = $filePath;
        
        $fh = fopen($filePath, 'r');
        if (!$fh)
        {
            throw new Exception('Could not open file ' . $filePath);
        }
        
        $currByteOffset = 0;
        while (($currRow = fgetcsv($fh, null, $delimiter)))
        {
            $this->ingestWord($currRow[0], $currByteOffset);
            $currByteOffset = ftell($fh);
        }
        
        fclose($fh);
    }
    
    /**
     * @param string $idFilePath
     * @return array
     * @throws Exception
     */
    public function getByteOffsetsForIdFile(string $idFilePath): array
    {
        $byteOffsets = [];
        $fh          = fopen($idFilePath, 'r');
        
        if (!$fh)
        {
            throw new Exception('Could not open file ' . $idFilePath);
        }
        
        while (($currLine = fgets($fh)))
        {
            $currByteOffsets = $this->findByteOffsetsForId(trim($currLine));
            $byteOffsets     = array_merge($byteOffsets, $currByteOffsets);
        }
        
        fclose($fh);
        
        asort($byteOffsets);
        
        return $byteOffsets;
    }
    
    /**
     * @param string $idFilePath
     * @param bool $firstColumnOnly
     * @return array
     * @throws Exception
     */
    public function getLinesMatchingIdFile(string $idFilePath, bool $firstColumnOnly = false): array
    {
        $byteOffsets = $this->getByteOffsetsForIdFile($idFilePath);
        $fh          = fopen($this->filePath, 'r');
        
        $output = [];
        foreach ($byteOffsets as $currOffset)
        {
            fseek($fh, $currOffset);
            $currRow = fgetcsv($fh, null, '|');
            
            $output[] = ($firstColumnOnly) ? $currRow[0] : $currRow;
        }
        
        return $output;
    }
    
    public function ingestWord(string $word, int $byteOffset): void
    {
        $word = $this->formatWord($word);
        
        if (empty($word))
        {
            return;
        }
        
        $charVector = str_split($word, 1);
        
        $this->ingestCharVector($charVector, $byteOffset);
    }
    
    /**
     * @param array $charVector
     * @param int $byteOffset
     * @return void
     */
    public function ingestCharVector(array $charVector, int $byteOffset): void
    {
        $char = array_shift($charVector);
        if (!array_key_exists($char, $this->tree))
        {
            $this->tree[$char] = new CharNode($char);
        }
        
        $currChild = $this->tree[$char];
        
        if (!empty($charVector))
        {
            $currChild->ingestCharVector($charVector, $byteOffset);
        }
    }
    
    /**
     * @param string $term
     * @return array
     */
    public function findByteOffsetsForId(string $term): array
    {
        // Reset state
        $this->byteOffsetOutput = [];
        $this->stringBuffer     = [];
        
        $word = $this->formatWord($term);
        
        if (empty($word))
        {
            return [];
        }
        
        $charVector = str_split($word, 1);
        
        $this->branchSearch($charVector, $this->tree);
        
        return $this->byteOffsetOutput;
    }
    
    /**
     * @param array $charVector
     * @param array $charNodeSet
     * @return void
     */
    private function branchSearch(array $charVector, array $charNodeSet): void
    {
        if (empty($charNodeSet))
        {
            return;
        }
        
        if (!empty($charVector))
        {
            $currChar = array_shift($charVector);
            if (!array_key_exists($currChar, $charNodeSet))
            {
                return;
            }
            
            /**
             * @var $currCharNode CharNode
             */
            $currCharNode = $charNodeSet[$currChar];
            
            // If this is the end of the search term, set th eline numbers
            if (empty($charVector))
            {
                $this->byteOffsetOutput = array_merge($this->byteOffsetOutput, $currCharNode->getByteOffsets());
            }
            
            $this->branchSearch($charVector, $currCharNode->getChildren());
        }
    }
    
    /**
     * @param string $word
     * @return array|string|string[]|null
     */
    private function formatWord(string $word)
    {
        $word = strtolower($word);

        $word = preg_replace("/[^a-z0-9 ]/", '', $word);
        
        return $word;
    }
    
    public function jsonSerialize()
    {
        return array_values($this->tree);
    }
}

That looks like a lot of code, but most of it is fairly idiomatic tree logic.

Using it is dead simple:

// Instantiate and load our tree
$tree = new IdTree('file1.txt');

// Get all matching rows
$matchingRows = $tree->getLinesMatchingIdFile('file2.txt');
print_r($matchingRows);

Output:

Array
(
    [0] => Array
        (
            [0] => 407166210197xxxx
            [1] => xx
            [2] => xxxx
            [3] => xxx
        )

    [1] => Array
        (
            [0] => 521307101305xxxx
            [1] => xx
            [2] => xxxx
            [3] => xxx
        )

    [2] => Array
        (
            [0] => 521307101485xxxx
            [1] => xx
            [2] => xxxx
            [3] => xxx
        )

)

It was not clear to me whether you wanted the entire row for each match, or just the first column, so I added a flag that allows that.

// Get only the first column of each line
$matchingIds = $tree->getLinesMatchingIdFile('file2.txt', true);
print_r($matchingIds);

Output:

Array
(
    [0] => 407166210197xxxx
    [1] => 521307101305xxxx
    [2] => 521307101485xxxx
)

There is some extra stuff you may not need, like the JSON output, which is useful for visualizing how the structure works. You could also make this more efficient if you know your data will always be formatted in certain ways (if your search IDs will always be within certain lengths, etc). You could still run into memory problems if you are processing truly massive data files. This is just a basic example of how you can go about solving problems like this "for real".

  • Related