Python csv with less separators than necessary-CodePudding

I have a csv file with errors. Some rows have 4 separators (;), while most part of the rows have 5 separators (;). The reason why I have sometimes 4 separators is because sometimes the 2nd column is missing.

Example:

var1;var2;var3;var4;var5
1;10;john;40;56
2;mary;34;78
3;90.0;smith;52;45

If I import the csv file:

import pandas as pd
df=pd.read_csv('myfile.csv', sep=";")

I obtain the dataframe:

var1 var2  var3  var4 var5
1    10    john  40   56
2    mary  34    78
3    90.0  smith 52   45

and I want:

var1 var2  var3  var4 var5
1    10    john  40   56
2    NaN   mary  34    78
3    90.0  smith 52   45

Maybe I could split the dataframe into two parts (one with 5 ";", another with 4 ";") and append in the end. I started with the code:

import csv
with open('myfile.csv',newline='') as fin, open('output.csv','w',newline='') as fout:
    reader = csv.reader(fin)
    writer = csv.writer(fout)
    for row in reader:
        if len(row) > 4:
            writer.writerow(row)

but I obtain the error: "line contains NUL".

CodePudding user response：

A workaround could be to shift part of the DataFrame:

# read with shifted columns
df = pd.read_csv('myfile.csv', sep=';')

# identify rows with incorrect data
m = df['var5'].isna()
# define columns to correct (second to last)
cols = df.columns[1:]

# correct the invalid rows
df.loc[m, cols] = df.loc[m, cols].shift(axis=1)

output:

   var1  var2   var3 var4  var5
0     1    10   john   40  56.0
1     2  None   mary   34  78.0
2     3  90.0  smith   52  45.0

CodePudding user response：

You might preprocess file.csv inserting ; if needed following way, let file.csv content be

var1;var2;var3;var4;var5
1;10;john;40;56
2;mary;34;78
3;90.0;smith;52;45

then

with open("file.csv","r") as fin, open("file_fixed.csv","w") as fout:
    for line in fin:
        if line.count(';') == 3:
            line = line.replace(';',';;',1)
        fout.write(line)

does create file_fixed.csv with following content

var1;var2;var3;var4;var5
1;10;john;40;56
2;;mary;34;78
3;90.0;smith;52;45

Explanation: if line has 3 ; then replace ; using ;; once (i.e. replace only first ; using ;;)

CodePudding user response：

Another possible solution, based on numpy.roll and pandas.DataFrame.apply:

df = pd.read_csv('myfile.csv', sep=';')

df.iloc[:, 1:] = df.iloc[:, 1:].apply(
    lambda x: np.roll(x, 1) if np.isnan(x[3]) else x, axis=1)

Output:

   var1  var2   var3 var4  var5
0     1    10   john   40  56.0
1     2   NaN   mary   34  78.0
2     3  90.0  smith   52  45.0

CodePudding user response：

about the particular bug you have in the end, you'll see by opening your 'output.csv' that it does not contains anything, that's why the error is thrown when you try to open it (by using pd.read_csv i suppose ?).

If you add a print line in your reading loop, you'll notice that each line is a 1-element list of a string with your values seperated by semicolons instead of a 5-elements list of values.

It's because you have to tell your reader method that you're using a semicolon as delimiter (instead of , which is the default one). So if you specify the delimiter ; for your reader and writer, it will create a new csv file with only the lines without missing values :

import csv
with open('myfile.csv',newline='') as fin, open('output.csv','w',newline='') as fout:
    reader = csv.reader(fin, delimiter = ';')
    writer = csv.writer(fout, delimiter = ';')
    for row in reader:
        print(row)
        if len(row) > 4:
            writer.writerow(row)

Hope this helps