Find all unique occurrences of first 3 char, count number of occurrences and write output to a file-CodePudding

I have a text file with over 250 million lines. Each line has a 3 digit area code followed by a comma and a 7 digit number.

Sample Input File:
201,2220000
201,5551212
310,5552481
376,1239876
443,0002222
572,8880099
...

I would like to generate an output file which lists each unique area code and the number of occurrences of that area code (only looking at the first 3 characters of each line).

Example output (area code, count):
201, 44556
202, 34529
...

I am working in a Windows 10 environment.

After considerable research, I was able to use the Switch function with regex in PowerShell to achieve something very close. The problem with this solution is that I need to know which area codes I am looking for (and I don't know all the area codes listed in this file).

I would like to modify the solution such that it finds all unique area codes and then run the code.

Here's what I have tried:

Say, I want to search for the following four area codes: 201,202,203,205
My text file is datafile.txt

$count1 = 0
$count2 = 0
$count3 = 0
$count4 = 0
switch -File C:\datafile.txt -Exact -Regex { '201\S{8}' {   $count1 } }
Write-Output "Area Code 201: $($count1)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '202\S{8}' {   $count2 } }
Write-Output "Area Code 202: $($count2)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '203\S{8}' {   $count3 } }
Write-Output "Area Code 203: $($count3)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '205\S{8}' {   $count4 } }
Write-Output "Area Code 204: $($count4)" | Format-Table | Out-File "C:\summary.txt" -append

This code generates the file summary.txt and appends the counts to the area codes. However, I think this is inefficient as:

I need to know all the area codes that are in this datafile.
I have to add 3 lines of code for every additional area code.

Would appreciate any help improving this code or for using an alternate solution (I found a thread on Stackoverflow that uses grep https://stackoverflow.com/questions/61229157/using-regex-in-grep-for-windows-command-line, but it has the same limitation - you need to know what string you are searching for.

CodePudding user response：

Assuming I understood correctly, there is no regex needed here, just .SubString(0, 3) to get the first 3 characters from each line and a hashtable to ensure unique codes and efficiency.

Indeed, switch -File is awesome for this task and should be used to read your file. Otherwise, for simplicity and also keeping it efficient, you could use File.ReadLines.

$map = @{ }
switch -File path\to\source\file.txt {
    Default {
        $map[$_.Substring(0, 3)]  = 1
    }
}

$map.GetEnumerator() | ForEach-Object {
    [pscustomobject]@{
        Code  = $_.Key
        Count = $_.Value
    }
} | Export-Csv path\to\resultOfUniqueCodes.csv -NoTypeInformation

CodePudding user response：

Try following :

$input = @"
area,number
201,44556
202,34529
201,44556
202,34529
201,44556
202,34529
201,44556
202,34529
"@

$table = $input | ConvertFrom-Csv
$table | Format-Table

$groups = $table | Group-Object {$_.area}

$outputTable = [System.Collections.ArrayList]::new()
foreach($group in $groups)
{
$group | Format-Table

   $newRow = New-Object -TypeName psobject
   $newRow | Add-Member -NotePropertyName area -NotePropertyValue $group.Name

   $newRow | Add-Member -NotePropertyName count -NotePropertyValue $group.Count

   $outputTable.Add($newRow)  | Out-Null
}
$outputTable | Format-Table

CodePudding user response：

Just my two cents for a streaming approach - trying to avoid everything that's comparatively slow, like ForEach-Object, pscustomobject and Export-Csv.

# Create a scriptblock to be able to pipe output of foreach loop
& { 
    foreach( $line in [IO.File]::ReadLines( 'input.txt' ) ) { 
        $line.Substring( 0, 3 )
    }
} | Group-Object -NoElement | & {
    begin {
        'Code,Count'
    }
    process {
        '{0},{1}' -f $_.Name, $_.Count
    }
} | Set-Content output.csv

Remarks:

foreach( $line in [IO.File]::ReadLines( 'input.txt' ) ) processes the input file lazily, so it's not read into memory as a whole. This works because ReadLines returns an iterator (not a collection) which foreach understands. As mentioned by others ReadLines is considered one of the fastest ways for line-by-line processing of text files, while still providing ease-of-use (compared to using .NET streams for instance).
Group-Object -NoElement just counts the number of occurences of unique input elements, propably using an internal hashtable, so it should be as fast as a manually created hashtable (not measured though - it would be really interesting).
Piping from Group-Object to a script block is much faster than ForEach-Object with a script block, see GitHub issue. Though in your case it doesn't matter much, the bottleneck will be reading and processing the input file.
As the kind of input data is known, we can avoid Export-Csv's complexities (like escaping rules) and create the CSV directly using simple string operations and Set-Content. Again, won't really make a difference here, but might still be good to know for other cases that are more output-heavy.