I have a csv file. The third column can contain odd separators like this:
Name,Last Name,Job,ID
John,Smith,Architect,ID2020
Taylor,Swift,Singer,Songwriter,ID2020-123
I know that the third column can some times be faulty, also next column always starts with ID. The odd comma is between singer and songwriter. How can I replace the odd comma with lets say tilda, so it would be possible to read the file using pandas without errors? The actual file has 30 columns, so regular expressions are probably way to go. Thanks for your time.
CodePudding user response:
IIUC, try:
with open("original.csv") as infile:
rows = infile.read().splitlines()
with open("output.csv", "w") as outfile:
for row in rows:
name, lname, *rest = row.split(",")
job = "~".join(rest[:-1])
ID = rest[-1]
outfile.write(f"{name},{lname},{job},{ID}\n")
df = pd.read_csv("output.csv")
>>> df
Name Last Name Job ID
0 John Smith Architect ID2020
1 Taylor Swift Singer~Songwriter ID2020-123
CodePudding user response:
The other "standard" way to separate columns in CSV files is using the semicolon.
The logic below does some string handling to split and rejoin, using the semicolon for the rejoining...
with open("somefile.csv") as infile:
data = infile.read().splitlines()
with open("someotherfile.csv", "w") as outfile:
for row in data:
splitrow = row.split(",")
if len(splitrow) > 4:
splitrow[2] = f"{splitrow[2]},{splitrow.pop(3)}"
outfile.write(";".join(splitrow) '\n')
import pandas as pd
df = pd.read_csv("someotherfile.csv", sep=';')
print(df)
output
Name Last Name Job ID
0 John Smith Architect ID2020
1 Taylor Swift Singer,Songwriter ID2020-123
CodePudding user response:
Try the following approach:
import pandas as pd
import csv
data = []
with open('input.csv') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input)
for row in csv_input:
data.append([*row[:2], ' '.join(row[2:-1]), row[-1]])
df = pd.DataFrame(data, columns=header)
print(df)
For your example, this gives:
Name Last Name Job ID
0 John Smith Architect ID2020
1 Taylor Swift Singer Songwriter ID2020-123
This assumes the unwanted commas are only in the Job column. It takes the Name
and Last Names
fields, then combines all fields until the last ID
field. So in effect the Job
field can have any number of commas.
This will need to be tweaked depending where all the other columns go.
CodePudding user response:
IIUC you can do it like this, going through line by line in the textfile. "also every column starts with ID" do you want it to be just the number ? I removed the ID in every row in my solution.
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
with open("input_list.txt") as f:
next(f)
for line in f:
name, lname, *job, ID = line.strip().split(",")
d["Name"].append(name)
d["Last Name"].append(lname)
d["Job"].append(" ".join(job))
d["ID"].append(ID[2:])
df = pd.DataFrame(d)
print(df)
Name Last Name Job ID
0 John Smith Architect 2020
1 Taylor Swift Singer Songwriter 2020-123