Home > Mobile >  adding new cloumn to dataframe based on values of columns and values from other dataframe
adding new cloumn to dataframe based on values of columns and values from other dataframe

Time:03-31

I have this data frame, df, that has boolean values :

    A  B  C
0   0  1  0
1   0  1  1
2   0  1  1
3   1  0  1
4   0  0  0
5   1  0  0
6   0  0  0
7   0  0  1
8   1  0  0
9   0  0  0
10  1  0  1
11  1  0  1
12  0  1  1
13  1  0  0
14  1  0  0
15  0  1  0
16  1  1  0
17  0  0  1
18  1  0  1
19  1  0  0
20  1  0  1
21  1  1  0
22  1  1  1
23  1  1  1
24  1  0  0
25  1  1  0
26  0  0  1
27  0  1  1
28  0  1  0
29  1  1  0
30  1  0  1
31  0  1  0
32  0  0  1
33  1  1  1
34  0  1  0
35  1  1  0
36  0  1  0
37  0  0  1
38  0  1  1
39  0  1  1

I stored the count of rows as follows :

N = len(df.index) # 40 in this case

Using groupby , I counted each instantiation of df as follows :

    count_series = df.groupby(["A", "B", "C"]).size() #all columns
    new_df = count_series.to_frame(name = 'count').reset_index()
    print(new_df)

The new_df looks like this :

   A  B  C  count
0  0  0  0     3
1  0  0  1     5
2  0  1  0     6
3  0  1  1     6
4  1  0  0     6
5  1  0  1     6
6  1  1  0     5
7  1  1  1     3

Now df row count is N=40 and I want to create a new dataframe ,dfD, that has the same columns as df plus additional column named P(A,B,C) which has the probability of each combination. for example , any row with the values 0,0,0 should have count/N (3/40) which is 0.075 I found these posts but all of them did not help because they are using cases since my df wont just have 3 columns (A,B,C) or just 40 rows. it might be bigger that that link1 link2 I want something that works with any dataframe of any size

CodePudding user response:

Convert each row into tuple and use groupby

grp = df.apply(tuple, axis=1)
pd.concat([df.groupby(grp).first(),
           grp.groupby(grp).count().div(len(df)).rename("Probs")],
          axis=1).reset_index(drop=True)

CodePudding user response:

We can use value_counts with parameter normalize=True to calculate the probabilities, then merge the resulting probablities with the original dataframe

c = ['A', 'B', 'C']
df.merge(df[c].value_counts(normalize=True).reset_index(name='prob'))

    A  B  C   prob
0   0  1  0  0.150
1   0  1  0  0.150
2   0  1  0  0.150
3   0  1  0  0.150
...
18  0  0  0  0.075
...
38  1  1  1  0.075
39  1  1  1  0.075

CodePudding user response:

Continuing your own approach ...

# subset of columns you want to use
columns = new_df.columns[:-1]
# use np.add.reduce to concatenate with astype(str)
new_df['key'] = np.add.reduce(new_df[columns].astype(str), axis=1)
new_df['prob'] = new_df['count'] / N

# do the same in df
df['key'] = np.add.reduce(df.astype(str), axis=1)

# maintain a mapping
mapping = new_df[['key', 'prob']].set_index('key').to_dict()['prob']

# map it
df['P(ABC)'] = df['key'].map(mapping)
df.drop('key', axis=1, inplace=True)
df[columns] = df[columns].astype(int)  # or astype(bool)
df

    A   B   C   P(ABC)
0   0   1   0   0.150
1   0   1   1   0.150
2   0   1   1   0.150
3   1   0   1   0.150
4   0   0   0   0.075
5   1   0   0   0.150
6   0   0   0   0.075
7   0   0   1   0.125
8   1   0   0   0.150
9   0   0   0   0.075
10  1   0   1   0.150
11  1   0   1   0.150
12  0   1   1   0.150
13  1   0   0   0.150
14  1   0   0   0.150
15  0   1   0   0.150
16  1   1   0   0.125
17  0   0   1   0.125
18  1   0   1   0.150
19  1   0   0   0.150
20  1   0   1   0.150
21  1   1   0   0.125
22  1   1   1   0.075
23  1   1   1   0.075
24  1   0   0   0.150
25  1   1   0   0.125
26  0   0   1   0.125
27  0   1   1   0.150
28  0   1   0   0.150
29  1   1   0   0.125
30  1   0   1   0.150
31  0   1   0   0.150
32  0   0   1   0.125
33  1   1   1   0.075
34  0   1   0   0.150
35  1   1   0   0.125
36  0   1   0   0.150
37  0   0   1   0.125
38  0   1   1   0.150
39  0   1   1   0.150
  • Related