Home > Blockchain >  CSV file fix for double quotes within fields that are not always double quoted while inserting into
CSV file fix for double quotes within fields that are not always double quoted while inserting into

Time:03-29

I have a set of daily CSV files received from a 3rd party data source in which fields are not double-quoted unless they contain a comma, HOWEVER someone decided it was a good idea to include double quotes within a field sometimes.

The file contains records like this on occasion, which fails Bulk Inserting into MS SQL on the first row:

4458,123456,Danny "LEE" Smith,123456789122,Pizza Inc
4458,789012,John Baker,987654321,"Company Name, LLC"

I need to update the file programmatically to something like this so that SQL's Bulk Insert doesn't fail with row 2's double quotes remaining:

4458,123456,Danny 'LEE' Smith,123456789122,Pizza Inc
4458,789012,John Baker,987654321,"Company Name, LLC"

Is there a programmatic way with say Regex to make the change for the whole file, or will I just have to loop through each line looking for double quotes not preceded or post-ceded by commas?

CodePudding user response:

I used this site https://regex101.com/

To test a regex

for this file:

4458,123456,Danny "LEE" Smith,123456789122,Pizza Inc
4458,789012,John Baker,987654321,"Company Name, LLC"
4458,789012,"John,Baker",987654321,Company Name LLC

With this regex:

[^,](")[^[$|,|\n]

I managed to pick out just the quotes around "LEE"

So you can use any regex tool to find and replace those.

Since you mentioned C#, here's a code sample I adapted from here https://www.c-sharpcorner.com/article/c-sharp-regex-examples/

string badString = "<your CSV file>" ;  
string regex = "[^,](\")[^[$|,|\n]";
string CleanedString = Regex.ReplaceGroupValue(badString, regex ,1,"'");  

I also had to look up escaping strings in C#. This is untested but give it a try.

CodePudding user response:

Maybe something like this:

private static void NormalizeCsv(string inputFile, string outputFile, Encoding encoding) {
    IEnumerable<string> inputLines = File.ReadLines(inputFile, encoding);
    using TextWriter outputLines = new StreamWriter(outputFile, append: false, encoding);
    outputLines.NewLine = "\r\n";
    foreach (string line in inputLines) {
        if (line.IndexOf('"') < 0) {
            outputLines.WriteLine(line);
        } else {
            string[] tokens = line.Split('"', Int32.MaxValue, StringSplitOptions.None);
            for (int i = 0; i < tokens.Length; i  ) {
                string token = tokens[i];
                if ((i % 2) == 0) {
                    outputLines.Write(token);
                } else {
                    if (token.IndexOf(',') > -1) {
                        outputLines.Write("\"");
                        outputLines.Write(token);
                        outputLines.Write("\"");
                    } else {
                        outputLines.Write("'");
                        outputLines.Write(token);
                        outputLines.Write("'");
                    }
                }
            }
            outputLines.WriteLine();
        }
    }
}
  • Related