I am running the example code Below:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder
url_data = pd.read_csv('phishing_site_urls.csv')
url_data.drop_duplicates(inplace = True)
print(url_data.shape)
#X = input Data (Urls) // Y = output (Wether its Bad or Good)
X = url_data.drop(columns=['Label'])
y = url_data['Label']
model = DecisionTreeClassifier()
model.fit(X, y)
predictions = model.predict([["Paste suspected Phishy Link here"]])
print(predictions)
**-Using a csv with the name phishing_site_urls.csv, that has two columns one named "URL" and the other "Label". Where the URL column holds links that are either phishy or valid and the label column has a corresponding "bad" or "good" for determining which link in the URL column is phishy or valid.
-My question is I keep getting the error: "ValueError: could not convert string to float:" I assume there has to be some way of encoding the links from strings to floats so the model can run? If so I would appreciate some insight on how I can do this.**
CodePudding user response:
I assume you are new to machine learning therefore before diving deep into Neural Nets and NLP(Natural Language Processing) papers, I think getting accustomed with how categorical data can be encoded in different scenarios would be a good first step. You can see the guide here: (Section 6.3.4. is the section for encoding categorical data)
https://scikit-learn.org/stable/modules/preprocessing.html
I also assume that this dataset is just for practice so it's better to pick out easier datasets than to jump straight into this one before you are not familiar with text preprocessing/word embedding etc.