Converting different datatypes to numerics using Python in CSV file-CodePudding

I have a CSV file, using Scikit Learn in Jupyter Notebook, I'm trying to apply different Machine Learning algorithms, for that I need to convert all my columns to numeric data to predict app rating which is in format of float numbers. the columns that need to be converted to numbers are below:

Category: (includes 27 different app categories such as education, medical, musical etc) Number of downloads for android apps : 1000000 to 5000000 , 500000 to 1000000 ... data updated : February 8 2017, October 26 2016 ...

CodePudding user response：

First of all, some of ML algorithms can handle categorical data (you don't need to encode it) such as tree-based ML algorithms.

Second, there are a bunch of techniques for converting categorical data to numerical format.

Short answer:

OneHot encoder would be one of the best. However, you should take into consideration the cardinality of the feature (number of values can feature have for example: it is 27 in Category column), as it will create a lot of columns.

You can use ordinal encoding or you can check this link to encode "Category". "Number of downloads" is already numerical, in this case you need to scale it, check the following link for scaling, you can use sklearn.

"Data updated" is date, so you need first to convert it into datetime using Pandas function to_datetime. Then, you can use this Pandas API to select how you want to convert the date.

CodePudding user response：

Let me break down what needs to be done in the most simplest of terms

Load the csv file onto a pandas dataframe: refer to documentation for detailed steps
```
import pandas as pd
df = pd.read_csv('filename.csv')
```
For Datetime, all you need to do is mention which columns are datetime using parse_dates parameter in read_csv
```
 df = pd.read_csv("filename.csv", parse_dates=date_cols)
```
Choose the right Machine learning algorithm based on your data and your problem statement.[classification, regression, time series or unsupervised]
Some machine learning algorithm need the information as categorical while most require numerical data. Once you have understood what you need to achieve, you need to use encoding techniques

Encoding is a technique of converting categorical variables into numerical values so that it could be easily fitted to a machine learning model.

Refer to categorical encoder documentation and this example link for information on different encoding techniques

Note: Considering you have 27 categories it isn't ideal to opt for one hot encoding due to problems associated with high dimensionality (parallelism and multicollinearity)