Home > Software design >  What is the fastest way to retrieve the header names from csv files
What is the fastest way to retrieve the header names from csv files

Time:08-04

I am trying to organize the column names by retrieving the unique header names of the csv files.

I used the following the code to retrieve the header names, but this script response is not fast when I have large size or millions of csv files in directories & subdirectories.

$files = Get-ChildItem "F:\MY_DATA\ASUSH" -Recurse
foreach ($f in $files) {
if ($f -Like "*.csv") {
  echo $f.FullName
  $Data=Get-Content -Path $f.FullName
  echo $Data[0]
}
}

What is the fastest way to retrieve the csv file header names?

CodePudding user response:

Get-Content has a -TotalCount parameter that will only read a certain number of lines.

$Data = Get-Content -Path $f.Fullname -TotalCount 1

That should speed things up.

CodePudding user response:

Leaving aside that direct use of .NET APIs can also be used to speed up enumeration of files, here's an efficient .NET API solution for reading the first line of each CSV file:

foreach ($f in Get-ChildItem F:\MY_DATA\ASUSH -Filter *.csv -Recurse) {
  $f.FullName
  # Read and output the first line of the file at hand.
  [Linq.Enumerable]::Take(
    [System.IO.File]::ReadLines($f.FullName),
    1
  )
}

Perhaps surprisingly, this is noticeably faster than the more concise, conceptually more direct solution in James Parr's helpful answer.

Even a hybrid approach,
[System.IO.File]::ReadLines($f.FullName) | Select-Object -First 1
performs better in my informal tests (but is slower than the cmdlet-less solution at the top).

All these solutions benefit from reading the file's lines one by one, on demand. That is, processing stops once the first line has been read (unlike your approach, which in essence is (Get-Content -Path $f.FullName)[0], which reads all lines into an array first, then extracts the first array element).

The reason that a Get-Content solution optimized with -TotalCount 1 (aka -First 1 aka -Head 1) is slower than an optimized .NET API solution is likely due to the fact that Get-Content decorates each output line with metadata, as discussed in the bottom section of this answer, which also contains general Get-Content performance tips.

  • Related