I'm using a "ColumnTransformer" even though I'm transforming only one feature because I don't know how else to change only the "clean_text" feature. I am not using a "make_column_transformer" with a "make_column_selector" because I would like to use a gridsearch later but I don't understand why I can't find column 0 of the dataset
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
#dataset download: https://www.kaggle.com/saurabhshahane/twitter-sentiment-dataset
df = pd.read_csv('Twitter_Data.csv')
y = df1['category'] #target
X = df1['clean_text'].values.astype('U') #feature, i transformed "X" into a string even if in theory it was because otherwise it would return an error
transformers = [
['text_vectorizer', CountVectorizer(), [0]];
]
ct = ColumnTransformer(transformers, remainder='passthrough')
ct.fit(X) #<---IndexError: tuple index out of range
X = ct.transform(X)
CodePudding user response:
Imo there are a couple of points to be highlighted on this example:
CountVectorizer
requires its input to be 1D. In such cases, documentation forColumnTransformer
states that
columns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer.
Therefore, the columns
parameter should be passed as an int rather than as a list of int. I would also suggest Sklearn custom transformers with pipeline: all the input array dimensions for the concatenation axis must match exactly for another reference.
Given that you're using a column transformer, I would pass the whole dataframe to method
.fit()
called on theColumnTransformer
instance, rather thanX
only.The dataframe seems to have missing values; it might be convenient to process them somehow. For instance, by dropping them and applying what is described above I was able to make it work, but you can also decide to proceed differently.
import pandas as pd import numpy as np from sklearn.compose import ColumnTransformer from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split #dataset download: https://www.kaggle.com/saurabhshahane/twitter-sentiment-dataset df = pd.read_csv('Twitter_Data.csv') y = df['category'] X = df['clean_text'] df.info() df_n = df.dropna() transformers = [ ('text_vectorizer', CountVectorizer(), 0) ] ct = ColumnTransformer(transformers, remainder='passthrough') ct.fit(df_n) ct.transform(df_n)
As specified within the comments,
transformers
should be specified as a list of tuples (as per the documentation) rather than as list of lists. However, running the snippet above with yourtransformers
specification seems to work. I've eventually observed that substituting tuples with lists elsewhere (in unrelated pieces of code I have) seems not to raise issues. However, as per my experience, it is for sure more common to find them passed as list of tuples.