import pandas as pd
Create a dataframe for an example problem:
data = [[['A','B','C','D','E'],1000],
[['C','B'],2000],
[['X','Y','Z'],3000]]
df = pd.DataFrame(data=data, columns=['token','view'])
what does he look like:
In[0]:df
Out[0]:
token view
0 [A, B, C, D, E] 1000
1 [C, B] 2000
2 [X, Y, Z] 3000
Now the main problem, I need to count the number of views for all unique values in each of the lists that are in the column df['token']. That is if the letter "B" is repeated in two rows with views 1000,2000 then I will get the sum of 3000. I created a code that counts the sum of these views, but it does not work correctly. Below I have shown an example of my code and what I would like to get
We create a list with all letters and convert it to a set and back to get only unique values without repetitions:
lst = []
for item in df['token']:
for it in item:
lst.append(it)
set_token = set(lst)
unique_token = list(set_token)
This is how it will look like:
In[1]:unique_token
Out[1]:
['D', 'E', 'Z', 'B', 'A', 'Y', 'C', 'X']
Now create a dictionary with keys from unique values and assign each 0 views as the initial value:
dict_token = {}
for item in unique_token:
dict_token[item] = 0
This is how it will look like:
In[2]:dict_token
Out[2]:
{'D': 0, 'E': 0, 'Z': 0, 'B': 0, 'A': 0, 'Y': 0, 'C': 0, 'X': 0}
Now I created code to calculate the sum of views for each value in a unique dictionary if it occurs in a string. But I can't get the value of df['view'] for the same row that is currently being iterated over and I get a series, and an incorrect count, how to fix it? below is an option what I would like to get at the end
for item in dict_token.keys():
for it in df['token']:
if item in it:
dict_token[item] = df['view']
and i get:
In[3]:dict_token
Out[3]:
{'D': 0 2000
1 4000
2 6000
Name: view, dtype: int64,
'E': 0 2000
1 4000
2 6000
Name: view, dtype: int64,
'Z': 0 2000
1 4000
2 6000
Name: view, dtype: int64,
'B': 0 4000
1 8000
2 12000
Name: view, dtype: int64,
'A': 0 2000
1 4000
2 6000
Name: view, dtype: int64,
'Y': 0 2000
1 4000
2 6000
Name: view, dtype: int64,
'C': 0 4000
1 8000
2 12000
Name: view, dtype: int64,
'X': 0 2000
1 4000
2 6000
Name: view, dtype: int64}
but i want:
Out[3]:{'D': 1000, 'E': 1000, 'Z': 3000, 'B': 3000, 'A': 1000, 'Y': 3000, 'C': 3000, 'X': 3000}
CodePudding user response:
You can explode
your lists, then compute the groupby.sum
and convert to_dict
:
df.explode('token').groupby('token')['view'].sum().to_dict()
output: {'A': 1000, 'B': 3000, 'C': 3000, 'D': 1000, 'E': 1000, 'X': 3000, 'Y': 3000, 'Z': 3000}