I have a set of daily CSV files received from a 3rd party data source in which fields are not double-quoted unless they contain a comma, HOWEVER someone decided it was a good idea to include double quotes within a field sometimes.
The file contains records like this on occasion, which fails Bulk Inserting into MS SQL on the first row:
4458,123456,Danny "LEE" Smith,123456789122,Pizza Inc
4458,789012,John Baker,987654321,"Company Name, LLC"
I need to update the file programmatically to something like this so that SQL's Bulk Insert doesn't fail with row 2's double quotes remaining:
4458,123456,Danny 'LEE' Smith,123456789122,Pizza Inc
4458,789012,John Baker,987654321,"Company Name, LLC"
Is there a programmatic way with say Regex to make the change for the whole file, or will I just have to loop through each line looking for double quotes not preceded or post-ceded by commas?
CodePudding user response:
I used this site https://regex101.com/
To test a regex
for this file:
4458,123456,Danny "LEE" Smith,123456789122,Pizza Inc
4458,789012,John Baker,987654321,"Company Name, LLC"
4458,789012,"John,Baker",987654321,Company Name LLC
With this regex:
[^,](")[^[$|,|\n]
I managed to pick out just the quotes around "LEE"
So you can use any regex tool to find and replace those.
Since you mentioned C#, here's a code sample I adapted from here https://www.c-sharpcorner.com/article/c-sharp-regex-examples/
string badString = "<your CSV file>" ;
string regex = "[^,](\")[^[$|,|\n]";
string CleanedString = Regex.ReplaceGroupValue(badString, regex ,1,"'");
I also had to look up escaping strings in C#. This is untested but give it a try.
CodePudding user response:
Maybe something like this:
private static void NormalizeCsv(string inputFile, string outputFile, Encoding encoding) {
IEnumerable<string> inputLines = File.ReadLines(inputFile, encoding);
using TextWriter outputLines = new StreamWriter(outputFile, append: false, encoding);
outputLines.NewLine = "\r\n";
foreach (string line in inputLines) {
if (line.IndexOf('"') < 0) {
outputLines.WriteLine(line);
} else {
string[] tokens = line.Split('"', Int32.MaxValue, StringSplitOptions.None);
for (int i = 0; i < tokens.Length; i ) {
string token = tokens[i];
if ((i % 2) == 0) {
outputLines.Write(token);
} else {
if (token.IndexOf(',') > -1) {
outputLines.Write("\"");
outputLines.Write(token);
outputLines.Write("\"");
} else {
outputLines.Write("'");
outputLines.Write(token);
outputLines.Write("'");
}
}
}
outputLines.WriteLine();
}
}
}