As said above I'm trying to create a model for detecting spam emails based on word occurrences. my information from my dataset is as follows:
- about 2800 variables representing each word and the frequency of their occurrences
- binary spam variable 1 for spam 0 for legit
I've been using online resources but can only find logistic regression and NN tutorials for much smaller datasets, which seem much simpler in comparison. So far I've totaled up the total words for spam and non spam to analyze, but I'm having trouble creating the model itself
Does anyone have any sources or insight on how to manage this with a much larger dataset?
Apologies for the simple question (if it is so) I appreciate any advice.
CodePudding user response:
A classical approach uses a generalised linear model (GLM) with a penalty for the number of variables. The GLM will be the logistic regression model in this case. The classic approach for the penalty is the LASSO, ridge regression and elastic net techniques. The shrinkage in your parameter values may be such that no parameters are selected to be predictive if your ratio of the number of variables (p) to the number of samples (N) is too high. Some parameters can control the shrinkage for that. It's overall a well studied topic. Your questions haven't asked about the programming language you will use, but you may find helpful packages in Python, R, Julia and other widespread data science programming languages. There will also be a lot of information in the CV community.
CodePudding user response:
I would start analysing each variable individually. I would implement a logistic regression for each one, and remain only with those whose p-value is really significative.
After this first step, then you can run a more complex logistic regression model, where you include the remaining variables in the first step.