I have this data frame, df, that has boolean values :
A B C
0 0 1 0
1 0 1 1
2 0 1 1
3 1 0 1
4 0 0 0
5 1 0 0
6 0 0 0
7 0 0 1
8 1 0 0
9 0 0 0
10 1 0 1
11 1 0 1
12 0 1 1
13 1 0 0
14 1 0 0
15 0 1 0
16 1 1 0
17 0 0 1
18 1 0 1
19 1 0 0
20 1 0 1
21 1 1 0
22 1 1 1
23 1 1 1
24 1 0 0
25 1 1 0
26 0 0 1
27 0 1 1
28 0 1 0
29 1 1 0
30 1 0 1
31 0 1 0
32 0 0 1
33 1 1 1
34 0 1 0
35 1 1 0
36 0 1 0
37 0 0 1
38 0 1 1
39 0 1 1
I stored the count of rows as follows :
N = len(df.index) # 40 in this case
Using groupby , I counted each instantiation of df as follows :
count_series = df.groupby(["A", "B", "C"]).size() #all columns
new_df = count_series.to_frame(name = 'count').reset_index()
print(new_df)
The new_df looks like this :
A B C count
0 0 0 0 3
1 0 0 1 5
2 0 1 0 6
3 0 1 1 6
4 1 0 0 6
5 1 0 1 6
6 1 1 0 5
7 1 1 1 3
Now df row count is N=40 and I want to create a new dataframe ,dfD, that has the same columns as df plus additional column named P(A,B,C) which has the probability of each combination. for example , any row with the values 0,0,0 should have count/N (3/40) which is 0.075 I found these posts but all of them did not help because they are using cases since my df wont just have 3 columns (A,B,C) or just 40 rows. it might be bigger that that link1 link2 I want something that works with any dataframe of any size
CodePudding user response:
Convert each row into tuple
and use groupby
grp = df.apply(tuple, axis=1)
pd.concat([df.groupby(grp).first(),
grp.groupby(grp).count().div(len(df)).rename("Probs")],
axis=1).reset_index(drop=True)
CodePudding user response:
We can use value_counts
with parameter normalize=True
to calculate the probabilities, then merge the resulting probablities with the original dataframe
c = ['A', 'B', 'C']
df.merge(df[c].value_counts(normalize=True).reset_index(name='prob'))
A B C prob
0 0 1 0 0.150
1 0 1 0 0.150
2 0 1 0 0.150
3 0 1 0 0.150
...
18 0 0 0 0.075
...
38 1 1 1 0.075
39 1 1 1 0.075
CodePudding user response:
Continuing your own approach ...
# subset of columns you want to use
columns = new_df.columns[:-1]
# use np.add.reduce to concatenate with astype(str)
new_df['key'] = np.add.reduce(new_df[columns].astype(str), axis=1)
new_df['prob'] = new_df['count'] / N
# do the same in df
df['key'] = np.add.reduce(df.astype(str), axis=1)
# maintain a mapping
mapping = new_df[['key', 'prob']].set_index('key').to_dict()['prob']
# map it
df['P(ABC)'] = df['key'].map(mapping)
df.drop('key', axis=1, inplace=True)
df[columns] = df[columns].astype(int) # or astype(bool)
df
A B C P(ABC)
0 0 1 0 0.150
1 0 1 1 0.150
2 0 1 1 0.150
3 1 0 1 0.150
4 0 0 0 0.075
5 1 0 0 0.150
6 0 0 0 0.075
7 0 0 1 0.125
8 1 0 0 0.150
9 0 0 0 0.075
10 1 0 1 0.150
11 1 0 1 0.150
12 0 1 1 0.150
13 1 0 0 0.150
14 1 0 0 0.150
15 0 1 0 0.150
16 1 1 0 0.125
17 0 0 1 0.125
18 1 0 1 0.150
19 1 0 0 0.150
20 1 0 1 0.150
21 1 1 0 0.125
22 1 1 1 0.075
23 1 1 1 0.075
24 1 0 0 0.150
25 1 1 0 0.125
26 0 0 1 0.125
27 0 1 1 0.150
28 0 1 0 0.150
29 1 1 0 0.125
30 1 0 1 0.150
31 0 1 0 0.150
32 0 0 1 0.125
33 1 1 1 0.075
34 0 1 0 0.150
35 1 1 0 0.125
36 0 1 0 0.150
37 0 0 1 0.125
38 0 1 1 0.150
39 0 1 1 0.150