Updated. Instead of
dict
data, I change for adataframe
as input
I'm analyzing a DataFrame with approximately 10,000 rows and 2 columns.
The criteria of my analysis is based on whether certain words appear in a certain cell.
I believe I will be more successful if I know which words are most relevant in terms of values...
Foo data to be used as an example:
data = { 'product': ['Dell Notebook I7', 'Dell Notebook I3', 'Logitech mx keys', 'Logitech mx 2'],
'cost': [1000,1200,300,100]}
df_data = pd.DataFrame(data)
product | cost | |
---|---|---|
0 | Dell Notebook I7 | 1000 |
1 | Dell Notebook I3 | 1200 |
2 | Logitech mx keys | 300 |
3 | Logitech mx 2 | 100 |
Basically, the column product
shoes the product an description.
In the column cost
shows the product cost.
What I want:
I would like to create another dataframe like this:
Desired Output:
unique_words | total_cost_for_unique_word | |
---|---|---|
1 | Dell | 2200 |
4 | Logitech | 2200 |
5 | Notebook | 2200 |
2 | I3 | 1200 |
3 | I7 | 1000 |
7 | mx | 400 |
6 | keys | 300 |
0 | 2 | 100 |
- Column
unique_words
with the list of each word that appears in the columnproduct
. - Column
total_cost_for_unique_word
with the sum of the values of products that contain that word.
I've tried searching for posts here from StackOverflow... Also, I've done google research, but I haven't found a solution. Maybe I still don't have the knowledge to find the answer.
If by any chance it has already been answered, please let me know and I will delete the post.
Thank you all.
CodePudding user response:
You can split
, explode
, groupby.agg
:
df = pd.DataFrame(data)
(df
.assign(unique_words=df['product'].str.split())
.explode('unique_words')
.groupby('unique_words', as_index=False)
.agg(**{'total cost': ('cost' ,'sum')})
.sort_values('total cost', ascending=False, ignore_index=True)
)
Output:
unique_words total cost
0 Dell 2200
1 Notebook 2200
2 I3 1200
3 I7 1000
4 Logitech 400
5 mx 400
6 keys 300
7 2 100
CodePudding user response:
If you first split the product into a list of all words (default is " "):
df["product"] = df["product"].str.split()
You can then explode this (for each item in the list as a new line), group all these together and sum the costs, then sorting and renaming columns to suit your outcome:
df.explode("product").groupby("product",as_index=False).agg("sum").sort_values("cost", ascending=False).rename(columns={"product": "unique_words", "cost", "total_cost_for_unique_word"})