Home > Enterprise >  Easiest way to fix extra commas in one column of a csv file
Easiest way to fix extra commas in one column of a csv file

Time:08-24

I have a very large CSV file that looks like this:

rownum, id, first, last, age, ADDRESS, weight, hair, pet, food
1, 123450, John, Bingo, 47, 123 Odd St., Waverly Place Apts, PO Box 12345, Apt#5E, Upper-Ontario, Eastern Province A12-E765, Not Puerto Rico, US, 299, red, cat, lasagna
2, 125379, Joe, Durante, 61, 19345 S. 1st Ave., Seattle, WA, 16748, 180, blonde, dog, hotdogs
3, 197572, James, Gringo, 39, 123 Maypole St., Northside Castle, upper east side, NY, NY 30594, 202, brown, dog, lo-mein
4, 129358, Jim, Dingus, 22, 0985 Martyr Ave, Fancytown, MA 49436, USA, 163, brown, goldfish, hamburgers
5, 987543, Dwayne 'The Rock', Johnson, 42, 555 Fitness Ln, Los Angeles, CA, 90210, 260, black, dog, steak
6, 048573, Jean, Grey, 33, 987 X-Men Rd., Rm. 3F, outside boston?, MA 34972, 130, red, <null>, salad
7, 756432, Jose, Cuervo, 59, 444 Jalisco Rd., agave_town, Mexico, not, sure, what, their zipcode system is?, 145, black, dog, margaritas
8, 845384, Junebug, Messerschmit, 2, 22nd Ave N, Boston, MA 45678, 130, blonde, turtle, lollipops
9, 634839, Jimbo, Humboldt, 99, 111 1st Street Kansas City KS 84638, 220, brown, ferrets, tacos
10, 483629, Julius, Caesar, 30, Emperors Estate in Ancient Rome, 145, brown, servants, grapes

I'm having trouble parsing the ADDRESS column due to the extra commas. My desired output will look something like this:

rownum| id| first| last| age| ADDRESS| weight| hair| pet| food
1| 123450| John| Bingo| 47| 123 Odd St., Waverly Place Apts, PO Box 12345, Apt#5E, Upper-Ontario, Eastern Province A12-E765, Not Puerto Rico, US| 299| red| cat| lasagna
2| 125379| Joe| Durante| 61| 19345 S. 1st Ave., Seattle, WA, 16748 180| blonde| dog| hotdogs
3| 197572| James| Gringo| 39| 123 Maypole St., Northside Castle, upper east side, NY, NY 30594| 202| brown| dog| lo-mein
4| 129358| Jim| Dingus| 22| 0985 Martyr Ave, Fancytown, MA 49436, USA| 163| brown| goldfish| hamburgers
5| 987543| Dwayne 'The Rock'| Johnson| 42| 555 Fitness Ln, Los Angeles, CA, 90210| 260| black| dog| steak
6| 048573| Jean| Grey| 33| 987 X-Men Rd., Rm. 3F, outside boston?,. MA 34972| 130| red| <null>| salad
7| 756432| Jose| Cuervo| 59| 444 Jalisco Rd., agave_town, Mexico, not, sure, what| their zipcode system is?| 145| black| dog| margaritas
8| 845384| Junebug| Messerschmit| 2| 22nd Ave N, Boston, MA 45678| 130| blonde| turtle| lollipops
9| 634839| Jimbo| Humboldt| 99| 111 1st Street Kansas City KS 84638| 220| brown| ferrets| tacos
10| 483629| Julius| Caesar| 30| Emperors Estate in Ancient Rome| 145| brown| servants| grapes

It doesn't have to be pipe-delimited, I just need it in a format that Excel can read correctly. I couldn't do it in Excel with Text Import Wizard, but maybe I'm missing something? Am I overlooking a simple solution for this?

First, I thought I could use Notepad to simply do a regex find and replace (e.g., replace first "," with a "|", go to next line, repeat. Run that 5 times. Then do regex backwards from end of each line and run that 4 times. I was not able to get this to work.

Now I'm trying to do it with python re or pandas, but I'm not getting very far since I am still fairly new to python. I don't have much experience with python ETL, csv/text file read/write, regex iterating on each line, etc.

I'm sure there are different ways to solve this (e.g., add dbl quote escape character after 5th comma from start and before 4th comma from end).

Here is the mess I have so far in my jupyter notebook

## Idea01a: The user defines the delimiter character and proper qty per line

delimiter=input("Type delimiter example here, then press enter: ")
l=input("paste 1st line of csv here (e.g. column headers only), then press enter: ")
d={}
## print(l)
for i in l:
 if i not in d:
  d[i]=l.count(i)
 else:
  pass

qty_proper_delimiters_total = (d[delimiter])
print("Delimiter character chosen:")
print(delimiter)
print("Proper number of delimiters:")
print(qty_proper_delimiters_total)

## Idea01b: User defines problem column

bad_column=input("""Enter afflicted column number, then press enter: 
(e.g., A=1, B=2, C=3, D=4, E=5, F=6, G=7, etc.)""")
print(bad_column)
qty_proper_delimiters_before_bad_column = (int(bad_column)-1)
print("Proper qty commas BEFORE bad column:")
print(qty_proper_delimiters_before_bad_column)
qty_proper_delimiters_after_bad_column = (qty_proper_delimiters_total-(int(bad_column)-1))
print("Proper qty commas AFTER bad column:")
print(qty_proper_delimiters_after_bad_column)

## Idea02: Insert escape character just right/left of flanking commas 

## (iterate n1 times from line start, then n2 times backwards from line end)
n1 = qty_proper_delimiters_before_bad_column
n2 = qty_proper_delimiters_after_bad_column

txt = input("Copy/Paste entire CSV here:")
## need to figure out how to iterate line by line

Thanks for your help.

CodePudding user response:

With the new data, I am posting an answer based on the idea I've written in the comment section: you might be able to identify the address part by stripping off first and last (some fixed amount of) fields. So try the following, and check carefully whether it achieves what you want.

import csv

with open('foo.csv') as f:
    for record in csv.reader(f):
        print(*record[:5], ', '.join(record[5:-4]), *record[-4:], sep='|')

Output:

rownum| id| first| last| age| ADDRESS| weight| hair| pet| food
1| 123450| John| Bingo| 47| 123 Odd St.,  Waverly Place Apts,  PO Box 12345,  Apt#5E,  Upper-Ontario,  Eastern Province A12-E765,  Not Puerto Rico,  US| 29
9| red| cat| lasagna
2| 125379| Joe| Durante| 61| 19345 S. 1st Ave.,  Seattle,  WA,  16748| 180| blonde| dog| hotdogs
3| 197572| James| Gringo| 39| 123 Maypole St.,  Northside Castle,  upper east side,  NY,  NY 30594| 202| brown| dog| lo-mein
4| 129358| Jim| Dingus| 22| 0985 Martyr Ave,  Fancytown,  MA 49436,  USA| 163| brown| goldfish| hamburgers
5| 987543| Dwayne 'The Rock'| Johnson| 42| 555 Fitness Ln,  Los Angeles,  CA,  90210| 260| black| dog| steak
6| 048573| Jean| Grey| 33| 987 X-Men Rd.,  Rm. 3F,  outside boston?,  MA 34972| 130| red| <null>| salad
7| 756432| Jose| Cuervo| 59| 444 Jalisco Rd.,  agave_town,  Mexico,  not,  sure,  what,  their zipcode system is?| 145| black| dog| margaritas
8| 845384| Junebug| Messerschmit| 2| 22nd Ave N,  Boston,  MA 45678| 130| blonde| turtle| lollipops
9| 634839| Jimbo| Humboldt| 99| 111 1st Street Kansas City KS 84638| 220| brown| ferrets| tacos
10| 483629| Julius| Caesar| 30| Emperors Estate in Ancient Rome| 145| brown| servants| grapes

CodePudding user response:

I did this purely for the joy of the challenge. I do not know if it is efficient or not. If it helps great - otherwise, j1-lee's answer will probably work out well.

import os

if os.path.exists("new_csv.csv"):
    os.remove("new_csv.csv")

with open("my_csv.csv", "r") as f:
    contents = f.readlines()

headers = [e.strip().lower() for e in contents[0].split(",")]
address = headers.index("address")

with open("new_csv.csv", "a") as f:
    for e in contents[1:]:
        c = e.split(",")
        f.writelines(f"""{"|".join(c[:address])}|{", ".join(c[address:(len(c) - len(headers)   address   1)])}|{"|".join(c[(len(c) - len(headers)   address   1):])}""")

CodePudding user response:

As long as the address field is the only one with extra commas, this will clean-up and write a CSV file that will read nicely in Excel:

import csv

with (open('input.csv', newline='') as fin,
      open('output.csv', 'w', newline='') as fout):

    r = csv.reader(fin, skipinitialspace=True) # handles non-standard comma-space separator
    w = csv.writer(fout)

    # header doesn't need extra processing
    header = next(r)
    w.writerow(header)

    # *address will capture any extra columns due to commas in address.
    for rownum, _id, first, last, age, *address, weight, hair, pet, food in r:
        # the extra address columns will be joined into one column.
        w.writerow([rownum, _id, first, last, age, ', '.join(address), weight, hair, pet, food])

Output:

outrownum,id,first,last,age,ADDRESS,weight,hair,pet,food
1,123450,John,Bingo,47,"123 Odd St., Waverly Place Apts, PO Box 12345, Apt#5E, Upper-Ontario, Eastern Province A12-E765, Not Puerto Rico, US",299,red,cat,lasagna
2,125379,Joe,Durante,61,"19345 S. 1st Ave., Seattle, WA, 16748",180,blonde,dog,hotdogs
3,197572,James,Gringo,39,"123 Maypole St., Northside Castle, upper east side, NY, NY 30594",202,brown,dog,lo-mein
4,129358,Jim,Dingus,22,"0985 Martyr Ave, Fancytown, MA 49436, USA",163,brown,goldfish,hamburgers
5,987543,Dwayne 'The Rock',Johnson,42,"555 Fitness Ln, Los Angeles, CA, 90210",260,black,dog,steak
6,048573,Jean,Grey,33,"987 X-Men Rd., Rm. 3F, outside boston?, MA 34972",130,red,<null>,salad
7,756432,Jose,Cuervo,59,"444 Jalisco Rd., agave_town, Mexico, not, sure, what, their zipcode system is?",145,black,dog,margaritas
8,845384,Junebug,Messerschmit,2,"22nd Ave N, Boston, MA 45678",130,blonde,turtle,lollipops
9,634839,Jimbo,Humboldt,99,111 1st Street Kansas City KS 84638,220,brown,ferrets,tacos
10,483629,Julius,Caesar,30,Emperors Estate in Ancient Rome,145,brown,servants,grapes
  • Related