Issue while inserting count vectorizer results to the dataframe-CodePudding

I have a dataframe with shape (4237, 19) and then other dataframe with the shape (4237, 6), I need to combine both these dataframes column wise, so technically resultant dataframe should be of the shape (4237, 25) but am getting as (5524, 25). Am not able to understand the issue.

Code which I have used.

social_media_vectorizer = CountVectorizer(lowercase=True)

train_social_media_vector = social_media_vectorizer.fit_transform(x_train["social_media"].values.astype("U"))
test_social_media_vector = social_media_vectorizer.transform(x_test["social_media"].values.astype('U'))

print(x_train.shape)
print(x_test.shape)

train_social_media_df = pd.DataFrame(train_social_media_vector.todense(), columns=social_media_vectorizer.get_feature_names_out())
test_social_media_df = pd.DataFrame(test_social_media_vector.todense(), columns=social_media_vectorizer.get_feature_names_out())
x_train = pd.concat([x_train, train_social_media_df], axis=1)
x_test = pd.concat([x_test, test_social_media_df], axis=1)

print("="*100)
print(x_train.shape)
print(x_test.shape)

print("="*100)
print(social_media_vectorizer.vocabulary_)

Result

(4237, 19)
(1816, 19)
====================================================================================================
(5524, 25)
(3058, 25)
====================================================================================================
{'facebook': 0, 'linkedin': 2, 'twitter': 4, 'instagram': 1, 'youtube': 5, 'producthunt': 3}

CodePudding user response：

Are you sure the shape of train_social_media_vector.todense() is (4237, 6)? It's seems to be (1287, 6)

Try to ignore_index=True:

x_train = pd.concat([x_train, train_social_media_df], axis=1, ignore_index=True)
x_test = pd.concat([x_test, test_social_media_df], axis=1, ignore_index=True)

CodePudding user response：

Check indexes of x_train and x_test before doing concat. I assume they have different indexes than the other ones. All rows are joined by the same index when doing concatenation. Missing rows will be filled with NaNs by default. If you do not care about indexes at all, simply drop them with .reset_index(drop=True) before doing concat, or ignore them with ignore_index=True in calling pd.concat(). See @Corralien 's answer above.