Home > Mobile >  sklearn.feature_selection.chi2 returns list of NaN values
sklearn.feature_selection.chi2 returns list of NaN values

Time:12-04

I have the following dataset (I will upload only a sample of 4 rows, the real one has 15,000 rows):

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.corpus import stopwords
from sklearn.feature_selection import chi2

quotes=["Sip N Shop Come thru right now Marjais PopularNobodies MMR Marjais SipNShop", 
        "I do not know about you but My family and I will not take the Covid19 vaccine anytime soon",
        "MSignorile Immunizations should be mandatory Period In Oklahoma they will not let kids go to school without them It is dangerous otherwise",
        "President Obama spoke in favor of vaccination for children Fox will start telling its viewers to choose against vaccination in 321"]

labels=[0,1,2,0]
dummy = pd.DataFrame({"quote": quotes, "label":labels})

And I want to apply the famous chi square test to eliminate the number of irrelevant words per category (0,1,2). Where 0: neutral, 1: positive, 2: negative.

Below is my approach (similar to the approach implemented here)
Briefly, I create an empty list of 0's equal to corpus length. 0's represent the first label of y = 0. For the second label (1=positive) I will create an empty list 1's. Similarly for the third label (2=negative).

After applying this 3 times (for each of the target labels), I will then have three 3 lists with the most dependent words per label. This final list will be my new vocabulary for the TF-IDF vectorizer.

def tweeter_tokenizer(tweet):
    return tweet.split(' ')
    
vectorizer = TfidfVectorizer(tokenizer=tweeter_tokenizer, ngram_range=(1,2), stop_words=english_stopwords)
vectorizer.fit(dummy["quote"])
X_train = vectorizer.transform(dummy["quote"])
y_train = dummy["label"]

feature_names = vectorizer.get_feature_names_out()
y_neutral     = np.array([0]*X_train.shape[0])
pValue        = 0.90

chi_neutral, p_neutral   = chi2(X_train, y_neutral)

chi_neutral

The chi_neutral object is:

array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

At the end I want to create a dataframe equal to the length of unique tokens (feature_names) per label. And I will keep only the words with score > pValue. The dataframe will show me how many from the total tokens of the corpus are dependent to class 0 (neutral). The same approach will be followed for the rest of the labels (1: positive, 2: negative).

y_df = np.array([0]*X_train.shape[1])
tokens_neutral_dependent = pd.DataFrame({
    "tweet_token": feature_names,
    "chi2_score" : 1-p_neutral,
    "neutral_label": y_df #length = length of feature_names()
})
tokens_neutral_dependent = tokens_neutral_dependent.sort_values(["neutral_label","chi2_score"], ascending=[True,False])
tokens_neutral_dependent = tokens_neutral_dependent[tokens_neutral_dependent["chi2_score"]>pValue]
tokens_neutral_dependent.shape

CodePudding user response:

I don't think it's really meaningful to compute the chi-squared statistic without having the classes attached. The code chi2(X_train, y_neutral) is asking "Assuming that class and the parameter are independent, what are the odds of getting this distribution?" But all of the examples you're showing it are the same class.

I would suggest this instead:

chi_neutral, p_neutral   = chi2(X_train, y_train)

If you're interested in chi-square statistics between particular classes, you can filter the dataset first to just two classes, then run the chi-squared test. But this step is not necessary.

  • Related