I have a text file with over 250 million lines. Each line has a 3 digit area code followed by a comma and a 7 digit number.
Sample Input File:
201,2220000
201,5551212
310,5552481
376,1239876
443,0002222
572,8880099
...
I would like to generate an output file which lists each unique area code and the number of occurrences of that area code (only looking at the first 3 characters of each line).
Example output (area code, count):
201, 44556
202, 34529
...
I am working in a Windows 10 environment.
After considerable research, I was able to use the Switch function with regex in PowerShell to achieve something very close. The problem with this solution is that I need to know which area codes I am looking for (and I don't know all the area codes listed in this file).
I would like to modify the solution such that it finds all unique area codes and then run the code.
Here's what I have tried:
- Say, I want to search for the following four area codes: 201,202,203,205
- My text file is datafile.txt
$count1 = 0
$count2 = 0
$count3 = 0
$count4 = 0
switch -File C:\datafile.txt -Exact -Regex { '201\S{8}' { $count1 } }
Write-Output "Area Code 201: $($count1)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '202\S{8}' { $count2 } }
Write-Output "Area Code 202: $($count2)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '203\S{8}' { $count3 } }
Write-Output "Area Code 203: $($count3)" | Format-Table | Out-File "C:\summary.txt" -append
switch -File C:\datafile.txt -Exact -Regex { '205\S{8}' { $count4 } }
Write-Output "Area Code 204: $($count4)" | Format-Table | Out-File "C:\summary.txt" -append
This code generates the file summary.txt and appends the counts to the area codes. However, I think this is inefficient as:
- I need to know all the area codes that are in this datafile.
- I have to add 3 lines of code for every additional area code.
Would appreciate any help improving this code or for using an alternate solution (I found a thread on Stackoverflow that uses grep https://stackoverflow.com/questions/61229157/using-regex-in-grep-for-windows-command-line, but it has the same limitation - you need to know what string you are searching for.
CodePudding user response:
Assuming I understood correctly, there is no regex needed here, just .SubString(0, 3)
to get the first 3 characters from each line and a hashtable to ensure unique codes and efficiency.
Indeed, switch -File
is awesome for this task and should be used to read your file. Otherwise, for simplicity and also keeping it efficient, you could use File.ReadLines
.
$map = @{ }
switch -File path\to\source\file.txt {
Default {
$map[$_.Substring(0, 3)] = 1
}
}
$map.GetEnumerator() | ForEach-Object {
[pscustomobject]@{
Code = $_.Key
Count = $_.Value
}
} | Export-Csv path\to\resultOfUniqueCodes.csv -NoTypeInformation
CodePudding user response:
Try following :
$input = @"
area,number
201,44556
202,34529
201,44556
202,34529
201,44556
202,34529
201,44556
202,34529
"@
$table = $input | ConvertFrom-Csv
$table | Format-Table
$groups = $table | Group-Object {$_.area}
$outputTable = [System.Collections.ArrayList]::new()
foreach($group in $groups)
{
$group | Format-Table
$newRow = New-Object -TypeName psobject
$newRow | Add-Member -NotePropertyName area -NotePropertyValue $group.Name
$newRow | Add-Member -NotePropertyName count -NotePropertyValue $group.Count
$outputTable.Add($newRow) | Out-Null
}
$outputTable | Format-Table
CodePudding user response:
Just my two cents for a streaming approach - trying to avoid everything that's comparatively slow, like ForEach-Object
, pscustomobject
and Export-Csv
.
# Create a scriptblock to be able to pipe output of foreach loop
& {
foreach( $line in [IO.File]::ReadLines( 'input.txt' ) ) {
$line.Substring( 0, 3 )
}
} | Group-Object -NoElement | & {
begin {
'Code,Count'
}
process {
'{0},{1}' -f $_.Name, $_.Count
}
} | Set-Content output.csv
Remarks:
foreach( $line in [IO.File]::ReadLines( 'input.txt' ) )
processes the input file lazily, so it's not read into memory as a whole. This works becauseReadLines
returns an iterator (not a collection) whichforeach
understands. As mentioned by othersReadLines
is considered one of the fastest ways for line-by-line processing of text files, while still providing ease-of-use (compared to using .NET streams for instance).Group-Object -NoElement
just counts the number of occurences of unique input elements, propably using an internal hashtable, so it should be as fast as a manually created hashtable (not measured though - it would be really interesting).- Piping from
Group-Object
to a script block is much faster thanForEach-Object
with a script block, see GitHub issue. Though in your case it doesn't matter much, the bottleneck will be reading and processing the input file. - As the kind of input data is known, we can avoid
Export-Csv
's complexities (like escaping rules) and create the CSV directly using simple string operations andSet-Content
. Again, won't really make a difference here, but might still be good to know for other cases that are more output-heavy.