I am working on a program to go through tweets and predict whether the author falls into one of two categories. I want to get_dummies for whether or not a tweet contains any of the top 10 hashtags or if it contains 'other'. (In the end I will probably be using the top 500 or so hashtags not just 10, the data set is over 500,000 columns in total with over 50,000 unique hashtags)
This is my first time using pandas, so apologies if my question is unclear, but I think what I'm expecting is each row in the data set would be given a new column, one for each hashtag, and then the value of that [row][column] pair would be 1 if the row contains that hashtag or 0 if it does not. There would also be a column for other indicating it has other hashtags not in the top 10.
I know how to determine the most frequently occurring in the column already
counts = df.hashtags.value_counts()
counts.nlargest(10)
I also understand how to get dummies, I just don't know how to add the parameter of not making one for every hashtag.
dummies = pd.get_dummies(df, columns=['hashtags'])
Please let me know if I could be clearer or provide more info. Appreciate the help!
CodePudding user response:
Don't have time to gen data and work it all out. But though I'd get you this idea in case it might help you out.
The idea is to leverage .isin()
to get the values that you need to build the dummies. Then leverage the power of the index to match to the source rows.
Something like:
pd.get_dummies(df.loc[df['hashtags'].isin(counts.nlargest(10).index)], columns=['hashtags'])
You will have to see if the indices will give you what you need.