issues reading csv line by line in python-CodePudding

edit using utf-16 seems to get me closer in the right direction, but I have csv values that include commas such as "one example value is a description, which is long and can include commas, and quotes"

So with my current code:

filepath="csv_input/frups.csv"

rows = []
with open(filepath, encoding='utf-16') as f:
    for line in f:
        print('line=',line)
        formatted_line=line.strip().split(",")
        print('formatted_line=',formatted_line)
        rows.append(formatted_line)
        print('')

Lines get formatted incorrectly:


line= "FRUPS"   "11111112"        "Paahou 11111112, 11111112,11111112"    "Bar, Achal"      "Iagress"   "Unassigned"    "Normal"        "GaWu , Suaair center will not be able to repair 3 couch part 11111112, 11111112,11111112 . Pleasa to repair .

formatted_line= ['"FRUPS"\t"11111112"\t"Parts not able to repair in Suzhou 11111112', ' 11111112', '11111112"\t"Baaaaaar', ' Acaaaal"\t"In Progress"\t"Unassigned"\t"Normal"\t"Got coaow Wu ', ' Suar cat 11111112', ' 11111112', '11111112. Pleasa to repair .']

line= 11111112

formatted_line= ['11111112']

So in this example, the line is separated by long spaces, but breaking up by commas is not as reliable for reading data line by line correctly

I am trying to read a csv line by line in python but each solution leads to a different error.

Using pandas:

filepath="csv_input/frups.csv"
data = pd.read_csv(filepath, encoding='utf-16')
for thing in data:
    print(thing)
    print('')

Fails to read_csv the file with an error Error tokenizing data. C error: Expected 7 fields in line 16, saw 8

Using csv_reader

# open file in read mode
with open(filepath, 'r') as read_obj:
    # pass the file object to reader() to get the reader object
    csv_reader = reader(read_obj)
    # Iterate over each row in the csv using reader object
    for row in csv_reader:
        # row variable is a list that represents a row in csv
        print(row)

Fails with error at for row in csv_reader line with line contains NUL

I've tried to figure out what these NUL characters our but trying to investigate using code leads to different errors:

data = open(filepath, 'rb').read()
print(data.find('\x00'))

error: argument should be integer or bytes-like object, not 'str'

another read solution trying to strip certain characters


with open(filepath,'rb') as f:
    contents = f.read()
contents = contents.rstrip("\n").decode("utf-16")
contents = contents.split("\r\n")

error: TypeError: a bytes-like object is required, not 'str'

It seems like my csv has some weird characters that cause python to error out. I can open and view my csv just fine in excel, how can I read my csv line by line? Such as

row[0]=['col1','col2','col3']
row[1]=['val1','val2','val3']
etc...

CodePudding user response：

What you have shown at line and formatted_line is a hint that:

your file is utf-16 encoded
it uses tabs (\t) as delimiters

So you should use:

with the csv module:

 # open file in read mode
 with open(filepath, 'r', encoding='utf-16') as read_obj:
     # pass the file object to reader() to get the reader object
     csv_reader = reader(read_obj, delimiter='\t')
     # Iterate over each row in the csv using reader object
     for row in csv_reader:
         # row variable is a list that represents a row in csv
         print(row)

with Pandas:

 data = pd.read_csv(filepath, encoding='utf-16', sep='\t')
 for thing in data:
     print(thing)
     print('')

CodePudding user response：

You can always read the file manually to build such a structure

rows = []
with open(filepath) as f:
    for line in f:
        rows.append(line.strip().split(","))