Home > OS >  Pandas - Sum for each unique word
Pandas - Sum for each unique word

Time:08-14

Updated. Instead of dict data, I change for a dataframe as input

I'm analyzing a DataFrame with approximately 10,000 rows and 2 columns.

The criteria of my analysis is based on whether certain words appear in a certain cell.

I believe I will be more successful if I know which words are most relevant in terms of values...

Foo data to be used as an example:
data = { 'product': ['Dell Notebook I7', 'Dell Notebook I3', 'Logitech mx keys', 'Logitech mx 2'],
         'cost': [1000,1200,300,100]}

df_data = pd.DataFrame(data)
product cost
0 Dell Notebook I7 1000
1 Dell Notebook I3 1200
2 Logitech mx keys 300
3 Logitech mx 2 100

Basically, the column product shoes the product an description. In the column cost shows the product cost.

What I want:

I would like to create another dataframe like this:

Desired Output:
unique_words total_cost_for_unique_word
1 Dell 2200
4 Logitech 2200
5 Notebook 2200
2 I3 1200
3 I7 1000
7 mx 400
6 keys 300
0 2 100
  • Column unique_words with the list of each word that appears in the column product.
  • Column total_cost_for_unique_word with the sum of the values of products that contain that word.

I've tried searching for posts here from StackOverflow... Also, I've done google research, but I haven't found a solution. Maybe I still don't have the knowledge to find the answer.

If by any chance it has already been answered, please let me know and I will delete the post.

Thank you all.

CodePudding user response:

You can split, explode, groupby.agg:

df = pd.DataFrame(data)
(df
 .assign(unique_words=df['product'].str.split())
 .explode('unique_words')
 .groupby('unique_words', as_index=False)
 .agg(**{'total cost': ('cost' ,'sum')})
 .sort_values('total cost', ascending=False, ignore_index=True)
 )

Output:

  unique_words  total cost
0         Dell        2200
1     Notebook        2200
2           I3        1200
3           I7        1000
4     Logitech         400
5           mx         400
6         keys         300
7            2         100

CodePudding user response:

If you first split the product into a list of all words (default is " "):

df["product"] = df["product"].str.split()

You can then explode this (for each item in the list as a new line), group all these together and sum the costs, then sorting and renaming columns to suit your outcome:

df.explode("product").groupby("product",as_index=False).agg("sum").sort_values("cost", ascending=False).rename(columns={"product": "unique_words", "cost", "total_cost_for_unique_word"})
  • Related