Home > Net >  How to read csv's into Pandas that use a quote separator but also non-escaped left and right do
How to read csv's into Pandas that use a quote separator but also non-escaped left and right do

Time:10-27

I have CSV files that need to be read in that have some lines that read fine but others that look like this:

2022-10-04, "", some data in col3, moredata, "data in quotes, like the size for this thing is 23'' x \28" with this description, moredata

2022-10-05, "", some data in col3, moredata, "data in quotes, like the size for this thing is 23“ x \28" with this description, moredata

so the problem I can't solve is: this is a CSV - so commas are the delimiter, and it uses double quote separators for values that have multiple commas in them that are not supposed to be read as delimiters, ok I figured out how to account for that in pandas read_csv options,

BUT, inside some of the quote separated fields, when there are numbers in inches, they use all 4 of:

escaped double quotes: \"

double single quotes: ''

AND left or right double quotes, like: that are not escaped, that I think might be getting misread as quote separator characters, and I am not sure how to ignore them.

I can't figure out how to get the CSV to read right in Pandas, or any other method. There are a lot of rows of data that use these left and right double quotes without escaping them, so if a row looks anything like:

something, "one value with 23'', 25\", 20“, ...", val 3, val_4

It has 4 values,

and the value "one value with 23'', 25\", 20“, end value" should be read in as 1 value: value with 23'', 25\", 20“, end value

But all the options I Have tried either end up skipping these rows or getting them read into the wrong columns, or just give errors and reading the data into a dataframe fails

Edit: per request from BeRT2me, here is a better example of a row from one of the CSV with 'actual' data. (I cannot provide any of the 'actual' values so put in fake data in the same format)

headers: start_date,end_date,product_code,available,category_rank,brand,name,category,price

row of data in the csv:

2022-10-05,2022-10-10,3716372837,1.0,"",brand1,"Puzzle map of the world, 300 pieces, 23” x 15\", great for all ages",Games,39.99

CodePudding user response:

Given test.txt:

start_date, end_date, product_code, available, category_rank, brand,name,category,price
2022-10-05,2022-10-10,3716372837,1.0,"",brand1,"Puzzle, 300'' p, 23” x 15\", great",Games,39.99

Doing:

df = pd.read_csv('test.txt', escapechar='\\')
print(df)

Output:

   start_date    end_date   product_code   available   category_rank   brand                               name category  price
0  2022-10-05  2022-10-10     3716372837         1.0             NaN  brand1  Puzzle, 300'' p, 23” x 15", great    Games  39.99
  • Related