Home > database >  Expand column containing list of tuples into the current dataframe
Expand column containing list of tuples into the current dataframe

Time:01-19

I have a dataframe in the following format:

df = pd.DataFrame({'column_with_tuples': [[('word1', 10), ('word2', 20), ('word3', 30)], [('word4', 40), ('word5', 50), ('word6', 60)]],
                   'category':['category1','category2']})

I want to move the tuples into two separate columns and preserve the category column to be able to easily filter the most common words for each category.

So the final result should look like this:

df_new = pd.DataFrame({'word': ['word1','word2', 'word3','word4','word5','word6'],
                   'frequency': [10, 20, 30, 40, 50, 60],
                   'category':['category1','category1', 'category1', 'category2', 'category2', 'category2']})

I tried with this code but the result is not the one I expect:

df_tuples = pd.concat([pd.DataFrame(x) for x in df['column_with_tuples']], ignore_index=True)

df_tuples.columns = ['word', 'frequency']

df.drop(['column_with_tuples'], axis=1, inplace=True)

df = pd.concat([df, df_tuples], axis=1)

I would appreciate some help here.

CodePudding user response:

You can initially explode column_with_tuples into multiple rows and then build a multiindex from a series of tuples (word, freaquency) with pd.MultiIndex.from_tuples:

df2 = df.explode('column_with_tuples')
df2.set_index(pd.MultiIndex.from_tuples(df2['column_with_tuples']))\
    .reset_index(names=['word', 'frequency']).drop(columns='column_with_tuples')

   word  frequency   category
0  word1         10  category1
1  word2         20  category1
2  word3         30  category1
3  word4         40  category2
4  word5         50  category2
5  word6         60  category2

CodePudding user response:

Using df.explode() method

df_new = df.explode("column_with_tuples")
df_new = df_new.rename(columns={"column_with_tuples": "word"})

df_new["word"], df_new["frequency"] = zip(*df_new["word"])
df_new = df_new.assign(category=df["category"])
df_new = df_new.reset_index(drop=True)
print(df_new)

    word   category  frequency
0  word1  category1         10
1  word2  category1         20
2  word3  category1         30
3  word4  category2         40
4  word5  category2         50
5  word6  category2         60

CodePudding user response:

One option with the explode method:

(df
.explode('column_with_tuples')
.assign(word = lambda df: df.column_with_tuples.str[0], 
        frequency = lambda df: df.column_with_tuples.str[1])
.drop(columns='column_with_tuples')
) 
    category   word  frequency
0  category1  word1         10
0  category1  word2         20
0  category1  word3         30
1  category2  word4         40
1  category2  word5         50
1  category2  word6         60

Another option, using vanilla python, before creating the final dataframe:

from itertools import product, chain

out = [product([cat], tuples) 
       for cat, tuples 
       in zip(df.category, df.column_with_tuples)]

out = chain.from_iterable(out)

out = [(cat, *tuples) for cat, tuples in out]

pd.DataFrame(out, columns = ['category', 'word', 'frequency'])
    category   word  frequency
0  category1  word1         10
1  category1  word2         20
2  category1  word3         30
3  category2  word4         40
4  category2  word5         50
5  category2  word6         60
  • Related