Using Powershell to remove illegal CRLF from csv row-CodePudding

Gentle Reader,

I have a year's worth of vendor csv files sitting in a directory. My task is to load them into a SQL Server DB as a "Historical Load". The files are mal-formed and while we are working with the vendor to re-send 365 new, properly structured files, I have been tasked with trying to work with what we have.

I'm restricted to using either C# (as a script task in SSIS) or Powershell.

Each file has no header but the schema is known and built into the SSIS package connection.

Each file has approx 35k rows and roughly a few dozen mal-formed rows per file.

Each properly formed row consists of 122 columns, 121 comma's.

Rows are NOT text qualified.

Example: (data cleaned of PII)

555222,555222333444,1,HN71232,1/19/2018 8:58:07 AM,3437,27.50,HECTOR EVERYMAN,25-Foot Garden Hose - ,1/03/2018 10:17:24 AM,,1835,,,,,online,,MERCH,1,MERCH,MI,,,,,,,,,,,,,,,,,,,,6611060033556677,2526677,,,,,,,,,,,,,,EVERYMAN,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,,,,,555666NA118855,2/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,2121,,,1/29/2018 9:50:56 AM,0,,,[CRLF]
555222,555222444888,1,CASUAL50,1/09/2018 12:00:00 PM,7000,50.00,JANE SMITH,$50 Casual Gift Card,1/19/2018 8:09:15 AM,1/29/2018 8:19:25 AM,1856,,,,,online,,FREE,1,CERT,GC,,,,,,,6611060033553311[CRLF]
,6611060033553311[CRLF]
,,,,,,,,,25,,,6611060033556677,2556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,CASUAL25,VWDEB,,,,,,,555222NA118065,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/19/2018 12:00:15 PM,0,,,[CRLF]
555222,555222777666,1,CASHCS,1/12/2018 10:31:43 AM,2500,25.00,BOB SMITH,BIG BANK Rewards Cash Back Credit [...6S66],,,1821,,,,,online,,CHECK,1,CHECK,CK,,,,,,,,,,,,,,,,,,,,555222166446,5556677,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,VWDEB,,,1/23/2018 10:30:21 AM,,,,555666NA118844,1/22/2018 12:00:00 AM,,,,,,,,,,,,,,,,,,,,,,,,1/22/2018 10:31:26 AM,0,,,[CRLF]

Powershell Get-Content (I think...) reads until file into an array where each row is identified by the CRLF as the terminator. This means (again, I think) that mal-formed rows will be treated as an element of the array without respect to how many "columns" it holds.

C# Streamreader also uses CRLF as a marker but a streamreader object also has a few methods available like Peek and Read that may be useful.

Please, Oh Wise Ones, point me in the direction of least resistance. Using Powershell, as a script to process mal-formed csv files such that CRLFs that are not EOL are removed.

Thank you.

CodePudding user response：

Based on @vonPryz design but in (Native¹) PowerShell:

$Delimiters  =  121
Get-Content .\OldFile.csv |ForEach-Object { $Line = '' } {
    if ($Line) { $Line  = ','   $_ } else { $Line = $_ }
    $TotalMatches = ($Line |Select-String ',' -AllMatches).Matches.Count
    if ($TotalMatches -ge $Delimiters ) {
        $Line
        $Line = ''
    }
} |Set-Content .\NewFile.Csv

_{1) I guess performance might be improved by avoiding = and using dot .net methods along with text streamers}

CodePudding user response：

Honestly, your best bet is to get good data from the supplier. Trying to work around a mess will just cause problems later on. Garbage in, garbage out. Since it's you who wrote the garbage data in the database, congratulations, it's now your fault that the DB data is of poor quality. Please talk with your manager and the stakeholders first, so that you have in writing an agreement that you didn't break the data and it was broken to start with. I've seen such problems on ETL processing all too often.

A quick and dirty pseudocode without error handling, edge case processing, substring index assumptions, performance guarantees and whatnot goes like so,

while(dataInFile)
  line = readline()

  :parseLine

  commasInLine = countCommas(line)

  if commasInLine == rightAmount
    addLineInOKBuffer(line)
  else
    commasNeeded = rightAmount - commasInLine

  if commasNeeded < 0
    # too many commas, two lines are combined
    lastCommaLocation = getLastCommaIndex(line, commasNeeded)
    addLineInOKBuffer(line.substring(0, lastCommaLocation)
    line = line.substring(lastCommaLocation, line.end)
    goto :parseline
  else 
    # too few lines, need to read next line too
    line = line.removeCrLf()   readline()
    goto :parseline

The idea is that first you look for a line and count how many commas there are. If the count matches what's expected, the row is not broken. Store it in a buffer containing good data.

If you have too many commas, then the row contains at least two different elements. Then find the index of where the first element ends, extract it and store it in the good data buffer. Then remove already processed part of the line and start again.

If you have too few commas, then the row is splitted by a newline. Read the next line from the file, join it with the current line and start the parsing again from counting the lines.