Find shapes of dataframes inside lambda functions-CodePudding

I have the following dataframe with pairs of strings in tuples

d = {'value': [['Red', 'Blue'],
               ['Blue', 'Yellow'],
               ['Blue', 'Yellow'],
               ['Yellow', 'Orange'],
               ['Green', 'Purple'],
               ['Purple', 'Yellow'],
               ['Yellow', 'Red'],
               ['Violet', 'Blue'],
               ['Blue', 'Green'],
               ['Green', 'Red'],
               ['Red', 'Brown'],
               ['Blue', 'Green']]}

df = pd.DataFrame(data = d)

And I want to find for each row probability, which can be calculated based on number of rows with same values

def find_prob(df, tup):
    
    d = df[df.new.apply(lambda x: x[0] == tup[0] and x[1] == tup[1])].shape[0]
    p = df[df.new.apply(lambda x: x[0] == tup[0])].shape[0]
    
    return d / p

df['probs'] = df.new.apply(lambda x: find_prob(df, x))

I know it's dumb to pass DataFrame in apply function so I want to known if there's a way to improve this logic

Desired output is:

P.S. I want to divide number of rows on number of rows, that start with first value of a tuple

CodePudding user response：

You can use groupby().transform('size') to count each of the types:

tuple_counts = df.groupby(df['value'].apply(tuple))['value'].transform('size')
first_counts = df.groupby(df['value'].str[0])['value'].transform('size')
df['prob'] = tuple_counts/first_counts

Output:

               value  prob
0        [Red, Blue]   0.5
1     [Blue, Yellow]   0.5
2     [Blue, Yellow]   0.5
3   [Yellow, Orange]   0.5
4    [Green, Purple]   0.5
5   [Purple, Yellow]   1.0
6      [Yellow, Red]   0.5
7     [Violet, Blue]   1.0
8      [Blue, Green]   0.5
9       [Green, Red]   0.5
10      [Red, Brown]   0.5
11     [Blue, Green]   0.5