Home > Back-end >  Find all unique occurrences of first 3 char, count number of occurrences and write output to a file
Find all unique occurrences of first 3 char, count number of occurrences and write output to a file

Time:12-03

I have a text file with over 250 million lines. Each line has a 3 digit area code followed by a comma and a 7 digit number.

Sample Input File:
201,2220000
201,5551212
310,5552481
376,1239876
443,0002222
572,8880099
...

I would like to generate an output file which lists each unique area code and the number of occurrences of that area code (only looking at the first 3 characters of each line).

Example output (area code, count):
201, 44556
202, 34529
...

I am working in a Windows 10 environment.

After considerable research, I was able to use the Switch function with in PowerShell to achieve something very close. The problem with this solution is that I need to know which area codes I am looking for (and I don't know all the area codes listed in this file).

I would like to modify the solution such that it finds all unique area codes and then run the code.

Here's what I have tried:


  1. Say, I want to search for the following four area codes: 201,202,203,205
  2. My text file is datafile.txt
$count1 = 0
$count2 = 0
$count3 = 0
$count4 = 0
switch -File C:\datafile.txt -Exact -Regex { '201\S{8}' {   $count1 } }
Write-Output "Area Code 201: $($count1)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '202\S{8}' {   $count2 } }
Write-Output "Area Code 202: $($count2)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '203\S{8}' {   $count3 } }
Write-Output "Area Code 203: $($count3)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '205\S{8}' {   $count4 } }
Write-Output "Area Code 204: $($count4)" | Format-Table | Out-File "C:\summary.txt" -append

This code generates the file summary.txt and appends the counts to the area codes. However, I think this is inefficient as:

  1. I need to know all the area codes that are in this datafile.
  2. I have to add 3 lines of code for every additional area code.

Would appreciate any help improving this code or for using an alternate solution (I found a thread on Stackoverflow that uses grep https://stackoverflow.com/questions/61229157/using-regex-in-grep-for-windows-command-line, but it has the same limitation - you need to know what string you are searching for.

CodePudding user response:

Assuming I understood correctly, there is no needed here, just .SubString(0, 3) to get the first 3 characters from each line and a hashtable to ensure unique codes and efficiency.

Indeed, switch -File is awesome for this task and should be used to read your file. Otherwise, for simplicity and also keeping it efficient, you could use File.ReadLines.

$map = @{ }
switch -File path\to\source\file.txt {
    Default {
        $map[$_.Substring(0, 3)]  = 1
    }
}

$map.GetEnumerator() | ForEach-Object {
    [pscustomobject]@{
        Code  = $_.Key
        Count = $_.Value
    }
} | Export-Csv path\to\resultOfUniqueCodes.csv -NoTypeInformation

CodePudding user response:

Try following :

$input = @"
area,number
201,44556
202,34529
201,44556
202,34529
201,44556
202,34529
201,44556
202,34529
"@

$table = $input | ConvertFrom-Csv
$table | Format-Table

$groups = $table | Group-Object {$_.area}

$outputTable = [System.Collections.ArrayList]::new()
foreach($group in $groups)
{
$group | Format-Table

   $newRow = New-Object -TypeName psobject
   $newRow | Add-Member -NotePropertyName area -NotePropertyValue $group.Name

   $newRow | Add-Member -NotePropertyName count -NotePropertyValue $group.Count

   $outputTable.Add($newRow)  | Out-Null
}
$outputTable | Format-Table

CodePudding user response:

Just my two cents for a streaming approach - trying to avoid everything that's comparatively slow, like ForEach-Object, pscustomobject and Export-Csv.

# Create a scriptblock to be able to pipe output of foreach loop
& { 
    foreach( $line in [IO.File]::ReadLines( 'input.txt' ) ) { 
        $line.Substring( 0, 3 )
    }
} | Group-Object -NoElement | & {
    begin {
        'Code,Count'
    }
    process {
        '{0},{1}' -f $_.Name, $_.Count
    }
} | Set-Content output.csv

Remarks:

  • foreach( $line in [IO.File]::ReadLines( 'input.txt' ) ) processes the input file lazily, so it's not read into memory as a whole. This works because ReadLines returns an iterator (not a collection) which foreach understands. As mentioned by others ReadLines is considered one of the fastest ways for line-by-line processing of text files, while still providing ease-of-use (compared to using .NET streams for instance).
  • Group-Object -NoElement just counts the number of occurences of unique input elements, propably using an internal hashtable, so it should be as fast as a manually created hashtable (not measured though - it would be really interesting).
  • Piping from Group-Object to a script block is much faster than ForEach-Object with a script block, see GitHub issue. Though in your case it doesn't matter much, the bottleneck will be reading and processing the input file.
  • As the kind of input data is known, we can avoid Export-Csv's complexities (like escaping rules) and create the CSV directly using simple string operations and Set-Content. Again, won't really make a difference here, but might still be good to know for other cases that are more output-heavy.
  • Related