Home > Net >  Pandas.read_csv() Decoding Error tokenizing data because of a comma in data
Pandas.read_csv() Decoding Error tokenizing data because of a comma in data

Time:06-30

I am having trouble reading in a csv that contains a comma within a row value.

An example row including the data causing the issue (afaik) is as follows:

['true',47,'y','descriptive_evidence','n','true',66,[81,65]]

I think that the [81,65] entry is being scanned literally and thus treated as two entries [81 and 65]. Is there any way to override this in pandas, or do i have to manually replace the comma prior to reading into a dataframe?

From reading other answers, I am aware of the possibility of skipping rows using something like error_bad_lines=False, but in this case i cant afford to skip these entries.

Best Wishes :)

CodePudding user response:

You could try sep with regex but it will be using python engine and not c and it can be memory/time consuming. Here is the solution if you would like to go with this:

1,2,3,4,5,6,7,8
'true',47,'y','descriptive_evidence','n','true',66,[81,65]
pd.read_csv("./file_name.csv",sep=r",(?![^[]*\])",engine="python")
|     | 1      | 2   | 3   | 4                      | 5   | 6      | 7   | 8       |
| --- | ------ | --- | --- | ---------------------- | --- | ------ | --- | ------- |
| 0   | 'true' | 47  | 'y' | 'descriptive_evidence' | 'n' | 'true' | 66  | [81,65] |

CodePudding user response:

This approach will standardize your file a bit, then load it in pandas.

a sample file:

['Bool','low_number','char','string','char2','bool','high_number','list_using_quotechar']
['true',47,'y','descriptive_evidence','n','true',66,[81,65]]
['true',47,'y','descriptive_evidence','n','true',66,[81,65]]

Code to standardize the file, and the load it:

import pandas as pd

with open('data_with_quote.csv') as original_file:
    with open('data_fixed.csv', 'w') as new_file:
        for line in original_file:
            line = line.replace('\n','') # remove newline so all lines are equal
            line = line[1:-1] # remove first and last charcter, '[' amd ']' respectively
            line = line.replace('[','"') # replace '[' with a quote_character that will work with pandas
            line = line.replace(']','"') # replace ']' with a quote_character that will work with pandas
            new_file.write(line   '\n')
            
your_data_as_df = pd.read_csv('data_fixed.csv',quotechar='"') # load file with quote_character from earlier
  • Related