Read CSV file with quotechar-comma combination in string

I have got multiple csv files which look like this:

ID,Text,Value
1,"I play football",10
2,"I am hungry",12
3,"Unfortunately",I get an error",15

I am currently importing the data using the pandas read_csv() function.

df = pd.read_csv(filename, sep = ',', quotechar='"')

This works for the first two rows in my csv file, unfortunately I get an error in row 3. The reason is that within the 'Text' column there is a quotechar character-comma combination before the end of the column.

ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 4

Is there a way to solve this issue?

Expected output:

ID  Text                            Value
1   I play football                 10
2   I am hungry                     12
3   Unfortunately, I get an error   15

CodePudding user response：

You can try to fix the CSV using re module:

import re
import pandas as pd
from io import StringIO

with open("your_file.csv", "r") as f_in:
    s = re.sub(
        r'"(.*)"',
        lambda g: '"'   g.group(1).replace('"', "\\")   '"',
        f_in.read(),
    )

df = pd.read_csv(StringIO(s), sep=r",", quotechar='"', escapechar="\\")
print(df)

Prints:

   ID                          Text  Value
0   1               I play football     10
1   2                   I am hungry     12
2   3  Unfortunately,I get an error     15

CodePudding user response：

A possible solution:

df = pd.read_csv(filename, sep='(?<=\d),|,(?=\d)', engine='python')
df = df.reset_index().set_axis(['ID', 'Text', 'Value'], axis=1)
df['Text'] = df['Text'].replace('\"', '', regex=True)

Output:

   ID                          Text  Value
0   1               I play football     10
1   2                   I am hungry     12
2   3  Unfortunately,I get an error     15

CodePudding user response：

One (not so flexible) approach would be to firstly remove all " quotes from the csv, and then enclose the elements of the specific column with "" quotes(this is done to avoid misinterpreting the "," seperator while parsing), like this:

import csv

# Specify the column index (0-based)
column_index = 1

# Open the input CSV file
with open('input.csv', 'r') as f:
    reader = csv.reader(f)

    # Open the output CSV file
    with open('output.csv', 'w', newline='') as g:
        writer = csv.writer(g)

        # Iterate through the rows of the input CSV file
        for row in reader:
            # Replace the " character with an empty string
            row[column_index] = row[column_index].replace('"', '')
            # Enclose the modified element in "" quotes
            row[column_index] = f'"{row[column_index]}"'
            # Write the modified row to the output CSV file
            writer.writerow(row)

This code creates a new modified csv file

Then your problematic csv row will look like that: 3,"Unfortunately,I get an error",15"

Then you can import the data like you did: df = pd.read_csv(filename, sep = ',', quotechar='"')

To automate this conversion for all csv files within a directory:

import csv
import glob

# Specify the column index (0-based)
column_index = 1

# Get a list of all CSV files in the current directory
csv_files = glob.glob('*.csv')

# Iterate through the CSV files
for csv_file in csv_files:
    # Open the input CSV file
    with open(csv_file, 'r') as f:
        reader = csv.reader(f)

        # Open the output CSV file
        output_file = csv_file.replace('.csv', '_new.csv')
        with open(output_file, 'w', newline='') as g:
            writer = csv.writer(g)

            # Iterate through the rows of the input CSV file
            for row in reader:
                # Replace the " character with an empty string
                row[column_index] = row[column_index].replace('"', '')
                # Enclose the modified element in "" quotes
                row[column_index] = f'"{row[column_index]}"'
                # Write the modified row to the output CSV file
                writer.writerow(row)

this names the new csv files as the old ones but with "_new.csv" instead of just ".csv".