How to replace odd separators from a csv file to something else using python regex?-CodePudding

I have a csv file. The third column can contain odd separators like this:

Name,Last Name,Job,ID
John,Smith,Architect,ID2020
Taylor,Swift,Singer,Songwriter,ID2020-123

I know that the third column can some times be faulty, also next column always starts with ID. The odd comma is between singer and songwriter. How can I replace the odd comma with lets say tilda, so it would be possible to read the file using pandas without errors? The actual file has 30 columns, so regular expressions are probably way to go. Thanks for your time.

CodePudding user response：

IIUC, try:

with open("original.csv") as infile:
    rows = infile.read().splitlines()

with open("output.csv", "w") as outfile:
    for row in rows:
        name, lname, *rest = row.split(",")
        job = "~".join(rest[:-1])
        ID = rest[-1]
        outfile.write(f"{name},{lname},{job},{ID}\n")

df = pd.read_csv("output.csv")
>>> df
     Name Last Name                Job          ID
0    John     Smith          Architect      ID2020
1  Taylor     Swift  Singer~Songwriter  ID2020-123

CodePudding user response：

The other "standard" way to separate columns in CSV files is using the semicolon.

The logic below does some string handling to split and rejoin, using the semicolon for the rejoining...

with open("somefile.csv") as infile:
    data = infile.read().splitlines()


with open("someotherfile.csv", "w") as outfile:
    for row in data:
        splitrow = row.split(",")
        if len(splitrow) > 4:
            splitrow[2] = f"{splitrow[2]},{splitrow.pop(3)}"
        outfile.write(";".join(splitrow) '\n')

import pandas as pd
df = pd.read_csv("someotherfile.csv", sep=';')
print(df)

output

     Name Last Name                Job          ID
0    John     Smith          Architect      ID2020
1  Taylor     Swift  Singer,Songwriter  ID2020-123

CodePudding user response：

Try the following approach:

import pandas as pd
import csv

data = []

with open('input.csv') as f_input:
    csv_input = csv.reader(f_input)
    header = next(csv_input)
    
    for row in csv_input:
        data.append([*row[:2], ' '.join(row[2:-1]), row[-1]])

df = pd.DataFrame(data, columns=header)
print(df)

For your example, this gives:

     Name Last Name                    Job          ID
0    John     Smith              Architect      ID2020
1  Taylor     Swift      Singer Songwriter  ID2020-123

This assumes the unwanted commas are only in the Job column. It takes the Name and Last Names fields, then combines all fields until the last ID field. So in effect the Job field can have any number of commas.

This will need to be tweaked depending where all the other columns go.

CodePudding user response：

IIUC you can do it like this, going through line by line in the textfile. "also every column starts with ID" do you want it to be just the number ? I removed the ID in every row in my solution.

import pandas as pd
from collections import defaultdict

d = defaultdict(list)

with open("input_list.txt") as f:
    next(f)
    for line in f:
        name, lname, *job, ID = line.strip().split(",")
        d["Name"].append(name)
        d["Last Name"].append(lname)
        d["Job"].append(" ".join(job))
        d["ID"].append(ID[2:])

df = pd.DataFrame(d)

print(df)

    Name    Last Name   Job                ID
0   John    Smith       Architect          2020
1   Taylor  Swift       Singer Songwriter  2020-123