I have a CSV that has the first few rows like this:
3Blue1Brown;UCYO_jab_esuFRV4b17AJtAw;Q&A #2 Net Neutrality Nuance;liL66CApESk;2017-12-14 03:59:29 00:00;141644;4661;107;329;1409.0;43.5607476635514;0.002322724577108808;100.52803406671399
3Blue1Brown;UCYO_jab_esuFRV4b17AJtAw;The hardest problem on the hardest test;OkmNXy7er84;2017-12-08 04:52:24 00:00;13109536;346554;5569;19721;1415.0;62.229125516250676;0.0015043247907477427;9264.689752650176
When I tried to load this into my NumPy array, I get some errors that I do not understand. I guess it might have to do with the special characters? Or may be the encoding format of this CSV data? Here is the code:
from numpy import loadtxt
import numpy as np
datas_path = 'target/youtube_videos.csv'
data = np.genfromtxt(datas_path, delimiter=';', dtype=None, names=True,\
deletechars="~!@#$%^&*()-= ~\|]}[{';: /?.>,<.", case_sensitive=True)
Here is the error:
Fail to execute line 11: deletechars="~!@#$%^&*()-= ~\|]}[{';: /?.>,<.", case_sensitive=True)
Traceback (most recent call last):
File "/tmp/1636087867308-0/zeppelin_python.py", line 158, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 11, in <module>
File "/home/joesan/.pyenv/versions/3.7.8/lib/python3.7/site-packages/numpy/lib/npyio.py", line 2124, in genfromtxt
raise ValueError(errmsg)
ValueError: Some errors were detected !
Line #60 (got 3 columns instead of 13)
Line #353 (got 3 columns instead of 13)
Line #720 (got 3 columns instead of 13)
Line #3008 (got 3 columns instead of 13)
Line #3077 (got 3 columns instead of 13)
Line #3129 (got 3 columns instead of 13)
Line #3154 (got 3 columns instead of 13)
Line #3163 (got 3 columns instead of 13)
Line #3175 (got 3 columns instead of 13)
Line #3290 (got 3 columns instead of 13)
Line #3300 (got 3 columns instead of 13)
Line #3310 (got 3 columns instead of 13)
Line #3316 (got 3 columns instead of 13)
Line #3321 (got 3 columns instead of 13)
Line #3328 (got 3 columns instead of 13)
Line #3334 (got 3 columns instead of 13)
Line #3340 (got 3 columns instead of 13)
Line #3361 (got 3 columns instead of 13)
Line #3366 (got 3 columns instead of 13)
Line #3367 (got 3 columns instead of 13)
Line #3375 (got 3 columns instead of 13)
Line #3385 (got 3 columns instead of 13)
Line #3397 (got 3 columns instead of 13)
Line #3407 (got 3 columns instead of 13)
Line #3433 (got 3 columns instead of 13)
Line #3444 (got 3 columns instead of 13)
Line #3450 (got 3 columns instead of 13)
Line #3452 (got 3 columns instead of 13)
Line #3482 (got 3 columns instead of 13)
Line #3511 (got 3 columns instead of 13)
Line #3522 (got 3 columns instead of 13)
Line #3531 (got 3 columns instead of 13)
Line #3536 (got 3 columns instead of 13)
Line # 60 is the first record in my CSV file given above.
EDIT: I managed to fix it like this:
data = np.genfromtxt(datas_path, delimiter=';', dtype=str, comments='%', names=True,\
deletechars="~!@#$%^&*()-= ~\|]}[{';: /?.>,<.", case_sensitive=True)
But now this line fails:
Adam Savage’s Tested;UCiDJtJKMICpb9B1qf7qjEOA;99% Invisible - The Adam Savage Project - 10/6/20;ClxSdX3ynGQ;2020-10-03 01:45:19 00:00;28334;782;29;87;385.0;26.96551724137931;0.0030705159878591094;73.59480519480519
CodePudding user response:
Unless you have a specific reason to use Numpy, Pandas may be a better option as it's better suited for parsing strings than Numpy. This snippet illustrates how to apply Pandas to parse the CSV using semicolons as the delimiter / separator:
import pandas as pd
datas_path = 'target/youtube_videos.csv'
df = pd.read_csv(datas_path, sep=';')
print(df)
This correctly parses the lines of text that were being incorrectly parsed with Numpy. Numpy is unfortunately a bit fragile when it comes to parsing text. But, as Pandas uses Numpy under the hood, one can easily adapt any Pandas objects - i.e. Series and DataFrames - for use with Numpy related routines.
Reference:
- Original comment.
- Pandas documentation on
read_csv
.