I just started so that might be stupid, but I have following problem: I created a .csv-file for some basic data description. However, although they are all numerical values without any missing values when using df.dtyped() I receive all variables as objects with only some being int64 or float64. Do I have to manually convert all object variables to numerical ones with code? Or is there anything I did wrong when creating my csv?
Also the date I have saved in the format yyyy-mm-dd is shown as object instead of date format.
The numbers of the data range from [0,2] for some variables and [0,2000000] for others. Could the formatting in Excel be a problem?
Is there any "How to build your csv"-documentation? So that I dont have to ask stupid beginner questions like this?
Additionally, I was told for a model to work properly I need to do some Scaling/Normalization of my data as the value ranges differ a lot.. Where can I find more information on that?
CodePudding user response:
I would suggest you just do data type conversion before saving the CSV file. you can use the below function as well for conversion.
'astype()'
'to_numeric()'
'convert_dtypes()'
you can use the attached link for scaling information. https://www.analyticsvidhya.com/blog/2020/07/types-of-feature-transformation-and-scaling/
CodePudding user response:
pd.read_csv
has already an option to specify the type so if you want you can specify the dtypeType with read_csv
. For the date, you always have to change the format to datetime
To scale or normalize your date is going to depend on which machine learning model you are going to use also. For example : if use a random forest and a KNN, the KNN will need to have scaling feature since it works with distance.
Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow: Concepts, Tools, and Techniques to Build Intelligent Systems is a good book to start in my personal opinion