I remember that I was told in my master that if we know the dtype of the columns of our pandas data frame, it is always a good practice to declare dtype. This makes sense to arise bugs at a early stage or find error in our data. I believe that this make the code faster.
I did this in my job, and one of my supervisor told me that what is the point to do that if python does for us, this is a bad practice I was told.
I have been reading around and I can find an answer. I believe that this is a good practice but I am asking here, to confirm this and find reason that support this or on the contrary, learn something new.
CodePudding user response:
It's typically just a matter of preference. For read_csv()
I personally choose to let Pandas infer these details as there are so many parameters, and inferring the dtype is reliable (otherwise it wouldn't be optional); you're more likely to be wrong about these details than Pandas is.
However, being explicit with the dtype can be useful as the low_memory
parameter is True by default, and as that processes the file in chunks, this has the risk of mixed types being inferred.
Inferring dtypes is efficient, so the performance improvement of passing it in is negligible.