Home > Software engineering >  Count unique ids in pandas column based on a nested dictionary
Count unique ids in pandas column based on a nested dictionary

Time:12-09

I have the below pandas data frame.

d = {'id1': ['85643', '85644','85643','8564312','8564314','85645','8564316','85646','8564318','85647','85648','85649','85655','56731','34566','78931','78931'],'ID': ['G-00001', 'G-00001','G-00002','G-00002','G-00002','G-00001','vaasd','aasd','aasd','vaasd','aasd','vaasd','aasd','aasd2','aasd2','aasd2','aasd2'],'col1': [671, 2,5,3,4,5,60,0,0,6,3,2,4,32,3,1,23],'Goal': [np.nan, 56,78,np.nan,89,73,np.nan ,np.nan ,np.nan, np.nan, np.nan, 34,np.nan, 7, 84,np.nan,5 ], 'col2': [793, 4,8,32,43,55,610,0,0,16,23,72,48,3,28,5,3],'col3': [500, 22,89,33,44,55,60,1,5,6,3,2,4,13,12,14,98],'Date': ['2021-06-13', '2021-06-13','2021-06-14','2021-06-13','2021-06-14','2021-06-15','2021-06-15','2021-06-13','2021-06-16','2021-06-13','2021-06-13','2021-06-13','2021-06-16','2021-05-23','2021-05-13','2021-03-26','2021-05-13']}
dff = pd.DataFrame(data=d)
dff

    id1     ID      col1    Goal    col2    col3    Date
0   85643   G-00001 671     NaN         793 500 2021-06-13
1   85644   G-00001 2       56.0000     4   22  2021-06-13
2   85643   G-00002 5       78.0000     8   89  2021-06-14
3   8564312 G-00002 3       NaN 32      33      2021-06-13
4   8564314 G-00002 4       89.0000     43  44  2021-06-14
5   85645   G-00001 5       73.0000     55  55  2021-06-15
6   8564316 vaasd   60      NaN         610 60  2021-06-15
7   85646   aasd    0       NaN         0   1   2021-06-13
8   8564318 aasd    0       NaN         0   5   2021-06-16
9   85647   vaasd   6       NaN         16  6   2021-06-13
10  85648   aasd    3       NaN         23  3   2021-06-13
11  85649   vaasd   2       34.0000     72  2   2021-06-13
12  85655   aasd    4       NaN         48  4   2021-06-16
13  56731   aasd2   32      7.0000      3   13  2021-05-23
14  34566   aasd2   3       84.0000     28  12  2021-05-13
15  78931   aasd2   1       NaN         5   14  2021-03-26
16  78931   aasd2   23      5.0000      3   98  2021-05-13

Also, I have the below dictionary

dic = {'G-0001':{'aasd':['G-0001','85646','85648','345_2','85655','85659'],
'vaasd':['G-0001','85649','34554','85655','22183','45335','8564316']},
'G-0002':{'aasd2':['G-0002','85343','78931','45121','56731']},}

I want to get a unique count of id1, based on the list inside the dictionary. For example, if we consider this list 'aasd2':['G-0002','85343','78931','45121','56731']. I want to get how many unique id1's are in pandas' data frame. So it should be for aasd2- 2 values ('78931','56731'). Those are the only two values in pandas for aasd2.

So I want to create a table just like below for the list name and count of ids

listName    count of ids
aasd            3
vaasd           2
aasd2           2

Is it possible to do this in python? Any suggestion would be appreciated. Thanks in advance!!

CodePudding user response:

Create a dataframe from your dict and merge it with dff then use value_counts after drop_duplicates on ('id1', 'ID'):

data = []
for d in dic.values():
    for k, l in d.items():
        data.extend([(v, k) for v in l])
df1 = pd.DataFrame(data, columns=['id1', 'ID'])

out = dff.merge(df1, on=['id1', 'ID']) \
         .drop_duplicates(['id1', 'ID']) \
         .value_counts('ID')

Output:

>>> out
ID
aasd     3
aasd2    2
vaasd    2
dtype: int64

CodePudding user response:

here is another way :

data = []
for d in dic.values():
    for k, l in d.items():
        data.extend([(v, k) for v in l])

df = df[df['id1', 'ID']].apply(tuple, axis=1).isin(data)].groupby("ID")["id1"].nunique()

CodePudding user response:

One option is to run the entire (almost) process in Python before producing the final dataframe:

from collections import ChainMap

# get the dictionary for the inner dict
content = dict(ChainMap(*dic.values()))
uniq = dff.id1.unique()
content = [(key, len(set(val).intersection(uniq))) 
           for key, val in content.items()]
pd.DataFrame(content, columns = ['listName', 'count of Ids'])

  listName  count of Ids
0    aasd2             2
1     aasd             3
2    vaasd             3
  • Related