I am having trouble reading in a csv that contains a comma within a row value.
An example row including the data causing the issue (afaik) is as follows:
['true',47,'y','descriptive_evidence','n','true',66,[81,65]]
I think that the [81,65] entry is being scanned literally and thus treated as two entries [81 and 65]. Is there any way to override this in pandas, or do i have to manually replace the comma prior to reading into a dataframe?
From reading other answers, I am aware of the possibility of skipping rows using something like error_bad_lines=False, but in this case i cant afford to skip these entries.
Best Wishes :)
CodePudding user response:
You could try sep
with regex but it will be using python engine
and not c and it can be memory/time consuming. Here is the solution if you would like to go with this:
1,2,3,4,5,6,7,8
'true',47,'y','descriptive_evidence','n','true',66,[81,65]
pd.read_csv("./file_name.csv",sep=r",(?![^[]*\])",engine="python")
| | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| --- | ------ | --- | --- | ---------------------- | --- | ------ | --- | ------- |
| 0 | 'true' | 47 | 'y' | 'descriptive_evidence' | 'n' | 'true' | 66 | [81,65] |
CodePudding user response:
This approach will standardize your file a bit, then load it in pandas.
a sample file:
['Bool','low_number','char','string','char2','bool','high_number','list_using_quotechar']
['true',47,'y','descriptive_evidence','n','true',66,[81,65]]
['true',47,'y','descriptive_evidence','n','true',66,[81,65]]
Code to standardize the file, and the load it:
import pandas as pd
with open('data_with_quote.csv') as original_file:
with open('data_fixed.csv', 'w') as new_file:
for line in original_file:
line = line.replace('\n','') # remove newline so all lines are equal
line = line[1:-1] # remove first and last charcter, '[' amd ']' respectively
line = line.replace('[','"') # replace '[' with a quote_character that will work with pandas
line = line.replace(']','"') # replace ']' with a quote_character that will work with pandas
new_file.write(line '\n')
your_data_as_df = pd.read_csv('data_fixed.csv',quotechar='"') # load file with quote_character from earlier