I need to process some CSV data I get from an external provider I can't control.
The data has a fixed amount of columns, but variable amounts of rows.
To summarize, it can look like this:
Version,Relative Path,Author,Date,Comment,Link
1,somePath,someperson,01/20/2020,"Some phrase with punctuation including comas, but in which case it seems to be automatically formatted with quotes...",some.url
3,somePath,someperson,01/20/2020,Simple comment with no punctuation,another.url
50,somePath,someperson,01/20/2020,Another comment,some.url
AFAIK the best way for me to do it is using this:
import csv
with open("sample.csv", "r") as csvFile:
reader = csv.DictReader(csvFile)
data = {}
for row_dict in reader:
for key, value in row_dict.items():
data[key] = data.get(key, []) [value]
print(data)
And the result looks great! Except for the "Comment" column: as the comments have punctuations, and the CSV delimiter is ,
, this leads to the script splitting these commands into different items and messing up the dictionary.
I wish I could ask the provider to make a specific CSV output for me, but that's unlikely to happen.
HOWEVER: it looks like any comment item that contains such commas inside are provided between quotation marks "
. So I was wondering if I could somehow make the delimiter ignore whatever's between quotes?
Or if you have anything better to obtain the same result, I'm fine with it. Though I can only use vanilla python 3.10.2, no modules like numpy (even though I'd love it). And if the solution is to edit the CSV's content, then it should preserve the comas inside the comments.
CodePudding user response:
Here is how I solve a similar situation...
quote_character='"'
file_reader = csv.DictReader(csvfile, delimiter=delimiter, quotechar=quote_character,
restkey='unrecognized_cols')
Note the quotechar parameter.