What I want to do is create a bag of words for 11410 strings and then append at the end of the word columns the result that I have stored in another dataframe. I have a dataframe with the column 'result' which I am trying to append as a new column next to my existing bag-of-words dataframe. However, I get a column that is full of 'NaN' values.
My dataframe is 11410 x 111 in dimension, and I want to add my dataframe column as the new column at the end. My code is as follows
bow = vectorizer.fit_transform(df_train['text']) #creates the vectorizer with the bag of words
bow_df = pd.DataFrame(bow.toarray(),columns=vectorizer.get_feature_names_out()) # turn the result to a dataframe
res = df_train['result'] #column of the dataframe that I want to insert
bow_df = bow_df.join(res) #this SHOULD (? but doesn't) do what I want
Therefore I end up with a 11410 x 112 but the last column is full of NaN's.
My res structure:
226115 POS
191228 NEU
198033 NEG
100300 NEU
208472 POS
...
119879 POS
103694 NEU
131932 NEU
146867 NEU
121958 NEU
My bow_df structure:
age ages also amp apollo approval approved arm astrazeneca aug ... \
0 0 0 0 0 0 0 0 0 0 0 ...
1 0 0 0 0 0 0 0 0 0 0 ...
2 0 0 0 0 0 0 0 0 0 0 ...
3 0 0 0 0 0 0 0 0 0 0 ...
4 0 0 0 0 0 0 1 0 0 0 ...
... .. ... ... .. ... ... ... .. ... .. ...
11405 0 0 0 0 0 1 0 0 0 0 ...
11406 0 0 0 0 0 0 0 0 0 0 ...
11407 0 0 0 0 0 0 0 0 0 0 ...
11408 1 0 0 0 0 0 0 0 0 1 ...
11409 1 0 0 0 0 0 0 0 0 0 ...
urban us use vaccinated vaccination vaccine vaccines world would year
0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 1 0 0 0
4 0 0 0 0 0 1 0 0 0 0
... ... .. .. ... ... ... ... ... ... ...
11405 0 0 1 0 0 0 0 0 0 0
11406 0 0 0 0 0 0 0 0 0 0
11407 0 0 0 0 0 0 0 0 0 0
11408 0 0 0 0 0 0 0 0 0 0
11409 0 0 0 0 0 0 0 0 0 0
I even tried to bow_df = bow_df.astype(str)
in case it was the type but didn't work.
Thanks everyone.
CodePudding user response:
join will join index-on-index if not otherwise specified (kwarg on
). res
has indexes that are not in range(11410)
, so you'll have to reset the index before joining:
res.reset_index(drop=True, inplace=True)
or building from df_train
:
res = df_train['result'].reset_index(drop=True)
CodePudding user response:
it is because the index are not matched. Try bow_df['result'] = res.values
to remove the RHS index.