I've got a csv file, where the first elements in each row contain sometimes an additional comma as part of the name. Since not all elements are put in quotation marks it will result in a messed up dataframe. I want to regard only the last n commas as commas, additional in the beginning should be deleted.
Example:
"name", "rank", "wealth"
Donald Trump, Jr, "45", "3"
Barack Obama, 44, 0
In this example, meaning dropping the first comma only in rows with more than 2 commas.
CodePudding user response:
On the basis of the same data shown, you could do this. However, if your input file is very badly messed up (i.e., your sample is not truly representative) then this may not work.
Note:
The input file for this code is a copy/paste of the sample given in the question
import pandas as pd
import io
MAXCOMMAS = 2
iob = io.StringIO()
with open('csv.csv') as data_in:
for line in data_in:
if (commas := line.count(',')) > MAXCOMMAS:
line = line.replace(',', '', commas-MAXCOMMAS)
iob.write(line.replace('"', ''))
iob.seek(0)
df = pd.read_csv(iob)
print(df)
Output:
name rank wealth
0 Donald Trump Jr 45 3
1 Barack Obama 44 0
CodePudding user response:
You could scrub the file with the csv
module before using pandas. The rule here is that if a row has more than 21 columns, all of the extra columns on the left (for left-to-right readers) are collapsed into a single comma-separated cell that is properly quoted by the csv
module on write. The csv
module is used so that anything that is already properly quoted or escaped isn't messed up in the process.
with open("test.csv", newline="") as infile, open("test-fixed.csv", "w", newline="") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for row in reader:
if len(row) > 21:
n = len(row) - 21 1
row[:n] = [",".join(row[:n])]
print(n, len(row))
writer.writerow(row)