I have been having issues where my data is in the wrong fields. I have a few large csv files that I have to manually update before loading into QLIK. The csv's have a comma (,) delimiter & double quote (") text qualifier. Some data has extra characters that throw it off and results in numeric numbers in text fields and vice versa. Can someone please advise the best/fastest way to combat this? To remove the unwanted " and save me from manually deleting quotes and pasting to correct fields for hundreds of records. I have created dummy data below.
Please note I am bit limited with the tools I have available to clean the csv. Or could you please advise the best tools/applications needed for this? Just unsure where to start
IN NOTEPAD:
ID,T_No,T_Type,T_Date,T_Name,T_TNo,
2,256,House,30/05/2021,Airport,75.1,
3,268,Hotel,31/05/2021,Hotel Antel""",76.1
4,269,House,31/05/2021,Bank of USA,"LA Branch""""",77.1
IN EXCEL: [enter image description here][1]
Any assistance is greatly appreciated.
Thank you [1]: https://i.stack.imgur.com/vyYAT.png
CodePudding user response:
If the issue is just with the T_Name
column, you could set the mode to CsvMode.NoEscape
, use the ClassMap
to get the fields you know you can get without issue and then use some logic to figure out where the T_Name
column ends and the T_TNo
column starts. There is a lot that could break in this code, depending on what the rest of the data looks like, but it should at least give you some ideas.
void Main()
{
var text = new StringBuilder();
text.AppendLine("ID,T_No,T_Type,T_Date,T_Name,T_TNo,");
text.AppendLine("2,256,House,30/05/2021,Airport,75.1,");
text.AppendLine("3,268,Hotel,31/05/2021,Hotel Antel\"\"\",76.1");
text.AppendLine("4,269,House,31/05/2021,Bank of USA,\"LA Branch\"\"\"\"\",77.1");
var config = new CsvConfiguration(CultureInfo.InvariantCulture)
{
Mode = CsvMode.NoEscape
};
using (var reader = new StringReader(text.ToString()))
using (var csv = new CsvReader(reader, config))
{
var options = new TypeConverterOptions { Formats = new[] { "dd/MM/yyyy" } };
csv.Context.TypeConverterOptionsCache.AddOptions<DateTime>(options);
csv.Context.RegisterClassMap<MyClassMap>();
var records = new List<MyClass>();
csv.Read();
csv.ReadHeader();
while (csv.Read())
{
var record = csv.GetRecord<MyClass>();
var name = string.Empty;
int i = 4;
var finished = false;
while (!finished)
{
var field = csv.GetField(i);
if (i == 4)
{
record.Name = field.Replace("\"", "");
i ;
continue;
}
var isNumber = float.TryParse(field, out var number);
if (!isNumber)
{
record.Name = ", " field.Replace("\"", "");
i ;
continue;
}
record.TNumber = number;
finished = true;
}
records.Add(record);
}
records.Dump();
}
}
public class MyClassMap : ClassMap<MyClass>
{
public MyClassMap()
{
Map(x => x.Id).Name("ID");
Map(x => x.Number).Name("T_No");
Map(x => x.Type).Name("T_Type");
Map(x => x.Date).Name("T_Date");
}
}
public class MyClass
{
public int Id { get; set; }
public int Number { get; set; }
public string Type { get; set; }
public DateTime Date { get; set; }
public string Name { get; set; }
public float TNumber { get; set; }
}
CodePudding user response:
If you are trying to import/read the csv file into Microsoft Excl sheet.
Suggesting to use text editor to replace all ,
with |
in csv file.
Use Excl sheet import utility. Use |
as delimiter and test the result.
In many cases Microsoft Excel is smart to figure out strings from numerals and dates. You will still have the extra "
in your sheet. And you might want remove them all.