Home > Mobile >  Loading CSV Data with NumPy Fails with an Error
Loading CSV Data with NumPy Fails with an Error

Time:11-05

I have a CSV that has the first few rows like this:

3Blue1Brown;UCYO_jab_esuFRV4b17AJtAw;Q&A #2   Net Neutrality Nuance;liL66CApESk;2017-12-14 03:59:29 00:00;141644;4661;107;329;1409.0;43.5607476635514;0.002322724577108808;100.52803406671399
3Blue1Brown;UCYO_jab_esuFRV4b17AJtAw;The hardest problem on the hardest test;OkmNXy7er84;2017-12-08 04:52:24 00:00;13109536;346554;5569;19721;1415.0;62.229125516250676;0.0015043247907477427;9264.689752650176

When I tried to load this into my NumPy array, I get some errors that I do not understand. I guess it might have to do with the special characters? Or may be the encoding format of this CSV data? Here is the code:

from numpy import loadtxt
import numpy as np

datas_path = 'target/youtube_videos.csv'
data = np.genfromtxt(datas_path, delimiter=';', dtype=None, names=True,\
       deletechars="~!@#$%^&*()-= ~\|]}[{';: /?.>,<.", case_sensitive=True)

Here is the error:

Fail to execute line 11:        deletechars="~!@#$%^&*()-= ~\|]}[{';: /?.>,<.", case_sensitive=True)
Traceback (most recent call last):
  File "/tmp/1636087867308-0/zeppelin_python.py", line 158, in <module>
    exec(code, _zcUserQueryNameSpace)
  File "<stdin>", line 11, in <module>
  File "/home/joesan/.pyenv/versions/3.7.8/lib/python3.7/site-packages/numpy/lib/npyio.py", line 2124, in genfromtxt
    raise ValueError(errmsg)
ValueError: Some errors were detected !
    Line #60 (got 3 columns instead of 13)
    Line #353 (got 3 columns instead of 13)
    Line #720 (got 3 columns instead of 13)
    Line #3008 (got 3 columns instead of 13)
    Line #3077 (got 3 columns instead of 13)
    Line #3129 (got 3 columns instead of 13)
    Line #3154 (got 3 columns instead of 13)
    Line #3163 (got 3 columns instead of 13)
    Line #3175 (got 3 columns instead of 13)
    Line #3290 (got 3 columns instead of 13)
    Line #3300 (got 3 columns instead of 13)
    Line #3310 (got 3 columns instead of 13)
    Line #3316 (got 3 columns instead of 13)
    Line #3321 (got 3 columns instead of 13)
    Line #3328 (got 3 columns instead of 13)
    Line #3334 (got 3 columns instead of 13)
    Line #3340 (got 3 columns instead of 13)
    Line #3361 (got 3 columns instead of 13)
    Line #3366 (got 3 columns instead of 13)
    Line #3367 (got 3 columns instead of 13)
    Line #3375 (got 3 columns instead of 13)
    Line #3385 (got 3 columns instead of 13)
    Line #3397 (got 3 columns instead of 13)
    Line #3407 (got 3 columns instead of 13)
    Line #3433 (got 3 columns instead of 13)
    Line #3444 (got 3 columns instead of 13)
    Line #3450 (got 3 columns instead of 13)
    Line #3452 (got 3 columns instead of 13)
    Line #3482 (got 3 columns instead of 13)
    Line #3511 (got 3 columns instead of 13)
    Line #3522 (got 3 columns instead of 13)
    Line #3531 (got 3 columns instead of 13)
    Line #3536 (got 3 columns instead of 13)

Line # 60 is the first record in my CSV file given above.

EDIT: I managed to fix it like this:

data = np.genfromtxt(datas_path, delimiter=';', dtype=str, comments='%', names=True,\
       deletechars="~!@#$%^&*()-= ~\|]}[{';: /?.>,<.", case_sensitive=True)

But now this line fails:

Adam Savage’s Tested;UCiDJtJKMICpb9B1qf7qjEOA;99% Invisible - The Adam Savage Project - 10/6/20;ClxSdX3ynGQ;2020-10-03 01:45:19 00:00;28334;782;29;87;385.0;26.96551724137931;0.0030705159878591094;73.59480519480519

CodePudding user response:

Unless you have a specific reason to use Numpy, Pandas may be a better option as it's better suited for parsing strings than Numpy. This snippet illustrates how to apply Pandas to parse the CSV using semicolons as the delimiter / separator:

import pandas as pd
datas_path = 'target/youtube_videos.csv'
df = pd.read_csv(datas_path, sep=';')
print(df)

This correctly parses the lines of text that were being incorrectly parsed with Numpy. Numpy is unfortunately a bit fragile when it comes to parsing text. But, as Pandas uses Numpy under the hood, one can easily adapt any Pandas objects - i.e. Series and DataFrames - for use with Numpy related routines.

Reference:

  1. Original comment.
  2. Pandas documentation on read_csv.
  • Related