Remove additional commas from csv file using Python Pandas-CodePudding

I've got a csv file, where the first elements in each row contain sometimes an additional comma as part of the name. Since not all elements are put in quotation marks it will result in a messed up dataframe. I want to regard only the last n commas as commas, additional in the beginning should be deleted.

Example:
"name", "rank", "wealth"
Donald Trump, Jr, "45", "3"
Barack Obama, 44, 0

In this example, meaning dropping the first comma only in rows with more than 2 commas.

CodePudding user response：

On the basis of the same data shown, you could do this. However, if your input file is very badly messed up (i.e., your sample is not truly representative) then this may not work.

Note:

The input file for this code is a copy/paste of the sample given in the question

import pandas as pd
import io

MAXCOMMAS = 2

iob = io.StringIO()

with open('csv.csv') as data_in:
    for line in data_in:
        if (commas := line.count(',')) > MAXCOMMAS:
            line = line.replace(',', '', commas-MAXCOMMAS)
        iob.write(line.replace('"', ''))

iob.seek(0)
df = pd.read_csv(iob)
print(df)

Output:

              name   rank   wealth
0  Donald Trump Jr     45        3
1     Barack Obama     44        0

CodePudding user response：

You could scrub the file with the csv module before using pandas. The rule here is that if a row has more than 21 columns, all of the extra columns on the left (for left-to-right readers) are collapsed into a single comma-separated cell that is properly quoted by the csv module on write. The csv module is used so that anything that is already properly quoted or escaped isn't messed up in the process.

with open("test.csv", newline="") as infile, open("test-fixed.csv", "w", newline="") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    for row in reader:
        if len(row) > 21:
            n = len(row) - 21   1
            row[:n] = [",".join(row[:n])]
            print(n, len(row))
        writer.writerow(row)