I have the following dataframe with pairs of strings in tuples
d = {'value': [['Red', 'Blue'],
['Blue', 'Yellow'],
['Blue', 'Yellow'],
['Yellow', 'Orange'],
['Green', 'Purple'],
['Purple', 'Yellow'],
['Yellow', 'Red'],
['Violet', 'Blue'],
['Blue', 'Green'],
['Green', 'Red'],
['Red', 'Brown'],
['Blue', 'Green']]}
df = pd.DataFrame(data = d)
And I want to find for each row probability, which can be calculated based on number of rows with same values
def find_prob(df, tup):
d = df[df.new.apply(lambda x: x[0] == tup[0] and x[1] == tup[1])].shape[0]
p = df[df.new.apply(lambda x: x[0] == tup[0])].shape[0]
return d / p
df['probs'] = df.new.apply(lambda x: find_prob(df, x))
I know it's dumb to pass DataFrame in apply function so I want to known if there's a way to improve this logic
Desired output is:
P.S. I want to divide number of rows on number of rows, that start with first value of a tuple
CodePudding user response:
You can use groupby().transform('size')
to count each of the types:
tuple_counts = df.groupby(df['value'].apply(tuple))['value'].transform('size')
first_counts = df.groupby(df['value'].str[0])['value'].transform('size')
df['prob'] = tuple_counts/first_counts
Output:
value prob
0 [Red, Blue] 0.5
1 [Blue, Yellow] 0.5
2 [Blue, Yellow] 0.5
3 [Yellow, Orange] 0.5
4 [Green, Purple] 0.5
5 [Purple, Yellow] 1.0
6 [Yellow, Red] 0.5
7 [Violet, Blue] 1.0
8 [Blue, Green] 0.5
9 [Green, Red] 0.5
10 [Red, Brown] 0.5
11 [Blue, Green] 0.5