Home > Enterprise >  Why Pandas read_csv is not reading all the data?
Why Pandas read_csv is not reading all the data?

Time:12-19

I would like to know if the problem that I have with a particular csv file is a general error from pandas or is something related with the csv file. I used pandas read_csv for get the information ... but unfortunately pandas, with this function, is not load all the values. I noticed of this error because i was pretty sure that i have data in it (Particular date 2017/04/01 - 2017/04/02), so i checked the file with excel and, as i thought, the values are there. I save the file as .xlsx and use again pandas forreading but with read_excel and the data load succesfully. The most weird at all is that the problem is present only in some dates... without any patron visible, because with read_csv load some information, but no complete.

Is the same file. Initially, when processing the data, the file was saved as .csv. Later, with the .csv created a .xlsx since Excel.

csv file: https://drive.google.com/file/d/1VCte8jCu8dB-Qp4KHClZb5cEAUTzZ5lC/view?usp=sharing

excel file: https://docs.google.com/spreadsheets/d/1p5zJuDhS7PvLwSJMtexRrUHOvC6qexMs/edit?usp=sharing&ouid=112818913372395231829&rtpof=true&sd=true

excel case:

resume = pnd.read_excel("/content/gdrive/hcln/h_RiA_0.50_full_time.xlsx", sheet_name = "h_RiA_0.50_full_time (1)", parse_dates = [0])
resume = resume.set_index(["Fecha"])
resume.loc["2017/04/01 23"]
                     h50
Fecha   
2017-04-01 23:00:00 309.0
2017-04-01 23:05:00 287.0
2017-04-01 23:10:00 315.0
2017-04-01 23:15:00 324.0
2017-04-01 23:20:00 325.0
2017-04-01 23:25:00 340.0
2017-04-01 23:30:00 323.0
2017-04-01 23:35:00 330.0
2017-04-01 23:40:00 332.0
2017-04-01 23:45:00 308.0
2017-04-01 23:50:00 319.0
2017-04-01 23:55:00 289.0

csv case:

resume = pnd.read_csv("/content/gdrive/MyDrive/hcln/h_RiA_0.50_full_time.csv", parse_dates = [0])     
resume = resume.set_index(["Fecha"])
resume.loc["2017/04/01 23"]
                    h50
Fecha   
2017-04-01 23:00:00 NaN
2017-04-01 23:05:00 NaN
2017-04-01 23:10:00 NaN
2017-04-01 23:15:00 NaN
2017-04-01 23:20:00 NaN
2017-04-01 23:25:00 NaN
2017-04-01 23:30:00 NaN
2017-04-01 23:35:00 NaN
2017-04-01 23:40:00 NaN
2017-04-01 23:45:00 NaN
2017-04-01 23:50:00 NaN
2017-04-01 23:55:00 NaN

If someone of you could get whats the error, i appreciate it your answer. Here you can get the view that i got in Google Colab.

csv view excel view

CodePudding user response:

I found the answer. Sometime ago, i change the name column of the csv for "h50", in this case, in Excel, in that moment no show any warning message, i supposed that it is not going to affect the containing values. Apparently, that's the reason, because, i back run again the process related with ** h_RiA_0.50_full_time.csv ** and fortunately by this way all the values is loading with read_csv.

I suppose that there is a kind of problem because i made that changes in the column name, in Excel, and for some reason it generates problems with load values.

CodePudding user response:

It may be something that is totally off the way.

But usually, whenever I have mounted the drive from Google Colab, I used:

/content/drive/MyDrive/Maestría/Tesis/Read_f/hcln/h_RiA_0.50_full_time.xlsx

instead of:

/content/gdrive/MyDrive/Maestría/Tesis/Read_f/hcln/h_RiA_0.50_full_time.xlsx

Another possibility is that the format it reads must be .csv not .xlsx

  • Related