I have CSV files that need to be read in that have some lines that read fine but others that look like this:
2022-10-04, "", some data in col3, moredata, "data in quotes, like the size for this thing is 23'' x \28" with this description, moredata
2022-10-05, "", some data in col3, moredata, "data in quotes, like the size for this thing is 23“ x \28" with this description, moredata
so the problem I can't solve is: this is a CSV - so commas are the delimiter, and it uses double quote separators for values that have multiple commas in them that are not supposed to be read as delimiters, ok I figured out how to account for that in pandas read_csv options,
BUT, inside some of the quote separated fields, when there are numbers in inches, they use all 4 of:
escaped double quotes: \"
double single quotes: ''
AND left or right double quotes, like: “
that are not escaped, that I think might be getting misread as quote separator characters, and I am not sure how to ignore them.
I can't figure out how to get the CSV to read right in Pandas, or any other method. There are a lot of rows of data that use these left and right double quotes without escaping them, so if a row looks anything like:
something, "one value with 23'', 25\", 20“, ...", val 3, val_4
It has 4 values,
and the value "one value with 23'', 25\", 20“, end value"
should be read in as 1 value: value with 23'', 25\", 20“, end value
But all the options I Have tried either end up skipping these rows or getting them read into the wrong columns, or just give errors and reading the data into a dataframe fails
Edit: per request from BeRT2me, here is a better example of a row from one of the CSV with 'actual' data. (I cannot provide any of the 'actual' values so put in fake data in the same format)
headers:
start_date,end_date,product_code,available,category_rank,brand,name,category,price
row of data in the csv:
2022-10-05,2022-10-10,3716372837,1.0,"",brand1,"Puzzle map of the world, 300 pieces, 23” x 15\", great for all ages",Games,39.99
CodePudding user response:
Given test.txt
:
start_date, end_date, product_code, available, category_rank, brand,name,category,price
2022-10-05,2022-10-10,3716372837,1.0,"",brand1,"Puzzle, 300'' p, 23” x 15\", great",Games,39.99
Doing:
df = pd.read_csv('test.txt', escapechar='\\')
print(df)
Output:
start_date end_date product_code available category_rank brand name category price
0 2022-10-05 2022-10-10 3716372837 1.0 NaN brand1 Puzzle, 300'' p, 23” x 15", great Games 39.99