I have a dataframe in the following format:
df = pd.DataFrame({'column_with_tuples': [[('word1', 10), ('word2', 20), ('word3', 30)], [('word4', 40), ('word5', 50), ('word6', 60)]],
'category':['category1','category2']})
I want to move the tuples into two separate columns and preserve the category column to be able to easily filter the most common words for each category.
So the final result should look like this:
df_new = pd.DataFrame({'word': ['word1','word2', 'word3','word4','word5','word6'],
'frequency': [10, 20, 30, 40, 50, 60],
'category':['category1','category1', 'category1', 'category2', 'category2', 'category2']})
I tried with this code but the result is not the one I expect:
df_tuples = pd.concat([pd.DataFrame(x) for x in df['column_with_tuples']], ignore_index=True)
df_tuples.columns = ['word', 'frequency']
df.drop(['column_with_tuples'], axis=1, inplace=True)
df = pd.concat([df, df_tuples], axis=1)
I would appreciate some help here.
CodePudding user response:
You can initially explode column_with_tuples
into multiple rows and then build a multiindex from a series of tuples (word, freaquency)
with pd.MultiIndex.from_tuples
:
df2 = df.explode('column_with_tuples')
df2.set_index(pd.MultiIndex.from_tuples(df2['column_with_tuples']))\
.reset_index(names=['word', 'frequency']).drop(columns='column_with_tuples')
word frequency category
0 word1 10 category1
1 word2 20 category1
2 word3 30 category1
3 word4 40 category2
4 word5 50 category2
5 word6 60 category2
CodePudding user response:
Using df.explode()
method
df_new = df.explode("column_with_tuples")
df_new = df_new.rename(columns={"column_with_tuples": "word"})
df_new["word"], df_new["frequency"] = zip(*df_new["word"])
df_new = df_new.assign(category=df["category"])
df_new = df_new.reset_index(drop=True)
print(df_new)
word category frequency
0 word1 category1 10
1 word2 category1 20
2 word3 category1 30
3 word4 category2 40
4 word5 category2 50
5 word6 category2 60
CodePudding user response:
One option with the explode
method:
(df
.explode('column_with_tuples')
.assign(word = lambda df: df.column_with_tuples.str[0],
frequency = lambda df: df.column_with_tuples.str[1])
.drop(columns='column_with_tuples')
)
category word frequency
0 category1 word1 10
0 category1 word2 20
0 category1 word3 30
1 category2 word4 40
1 category2 word5 50
1 category2 word6 60
Another option, using vanilla python, before creating the final dataframe:
from itertools import product, chain
out = [product([cat], tuples)
for cat, tuples
in zip(df.category, df.column_with_tuples)]
out = chain.from_iterable(out)
out = [(cat, *tuples) for cat, tuples in out]
pd.DataFrame(out, columns = ['category', 'word', 'frequency'])
category word frequency
0 category1 word1 10
1 category1 word2 20
2 category1 word3 30
3 category2 word4 40
4 category2 word5 50
5 category2 word6 60