I have a csv file with errors. Some rows have 4 separators (;), while most part of the rows have 5 separators (;). The reason why I have sometimes 4 separators is because sometimes the 2nd column is missing.
Example:
var1;var2;var3;var4;var5
1;10;john;40;56
2;mary;34;78
3;90.0;smith;52;45
If I import the csv file:
import pandas as pd
df=pd.read_csv('myfile.csv', sep=";")
I obtain the dataframe:
var1 var2 var3 var4 var5
1 10 john 40 56
2 mary 34 78
3 90.0 smith 52 45
and I want:
var1 var2 var3 var4 var5
1 10 john 40 56
2 NaN mary 34 78
3 90.0 smith 52 45
Maybe I could split the dataframe into two parts (one with 5 ";", another with 4 ";") and append in the end. I started with the code:
import csv
with open('myfile.csv',newline='') as fin, open('output.csv','w',newline='') as fout:
reader = csv.reader(fin)
writer = csv.writer(fout)
for row in reader:
if len(row) > 4:
writer.writerow(row)
but I obtain the error: "line contains NUL".
CodePudding user response:
A workaround could be to shift
part of the DataFrame:
# read with shifted columns
df = pd.read_csv('myfile.csv', sep=';')
# identify rows with incorrect data
m = df['var5'].isna()
# define columns to correct (second to last)
cols = df.columns[1:]
# correct the invalid rows
df.loc[m, cols] = df.loc[m, cols].shift(axis=1)
output:
var1 var2 var3 var4 var5
0 1 10 john 40 56.0
1 2 None mary 34 78.0
2 3 90.0 smith 52 45.0
CodePudding user response:
You might preprocess file.csv
inserting ;
if needed following way, let file.csv
content be
var1;var2;var3;var4;var5
1;10;john;40;56
2;mary;34;78
3;90.0;smith;52;45
then
with open("file.csv","r") as fin, open("file_fixed.csv","w") as fout:
for line in fin:
if line.count(';') == 3:
line = line.replace(';',';;',1)
fout.write(line)
does create file_fixed.csv with following content
var1;var2;var3;var4;var5
1;10;john;40;56
2;;mary;34;78
3;90.0;smith;52;45
Explanation: if line has 3 ;
then replace ;
using ;;
once (i.e. replace only first ;
using ;;
)
CodePudding user response:
Another possible solution, based on numpy.roll
and pandas.DataFrame.apply
:
df = pd.read_csv('myfile.csv', sep=';')
df.iloc[:, 1:] = df.iloc[:, 1:].apply(
lambda x: np.roll(x, 1) if np.isnan(x[3]) else x, axis=1)
Output:
var1 var2 var3 var4 var5
0 1 10 john 40 56.0
1 2 NaN mary 34 78.0
2 3 90.0 smith 52 45.0
CodePudding user response:
about the particular bug you have in the end, you'll see by opening your 'output.csv' that it does not contains anything, that's why the error is thrown when you try to open it (by using pd.read_csv i suppose ?).
If you add a print
line in your reading loop, you'll notice that each line is a 1-element list of a string with your values seperated by semicolons instead of a 5-elements list of values.
It's because you have to tell your reader method that you're using a semicolon as delimiter (instead of ,
which is the default one).
So if you specify the delimiter ;
for your reader and writer, it will create a new csv file with only the lines without missing values :
import csv
with open('myfile.csv',newline='') as fin, open('output.csv','w',newline='') as fout:
reader = csv.reader(fin, delimiter = ';')
writer = csv.writer(fout, delimiter = ';')
for row in reader:
print(row)
if len(row) > 4:
writer.writerow(row)
Hope this helps