Home > Back-end >  In pandas data frame how to remove some summarized duplicates rows
In pandas data frame how to remove some summarized duplicates rows

Time:11-24

I have the below pandas data frame.

d = {'id1': ['85643', '85644','85643','8564312','8564314','85645','8564316','85646','8564318','85647','85648','85649','85655'],'ID': ['G-00001', 'G-00001','G-00002','G-00002','G-00002','G-00001','G-00002','G-00001','G-00002','G-00001','G-00001','G-00001','G-00001'],'col1': [671, 2,5,3,4,5,60,0,0,6,3,2,4],'Goal': [np.nan, 56,78,np.nan,89,73,np.nan ,np.nan ,np.nan, np.nan, np.nan, 34,np.nan ], 'col2': [793, 4,8,32,43,55,610,0,0,16,23,72,48],'col3': [500, 22,89,33,44,55,60,1,5,6,3,2,4],'Name': ['aasd', 'aasd','aabsd','aabsd','aabsd','aasd','aabsd','aasd','aabsd','aasd','aasd','aasd','aasd'],'Date': ['2021-06-13', '2021-06-13','2021-06-14','2021-06-13','2021-06-14','2021-06-15','2021-06-15','2021-06-13','2021-06-16','2021-06-13','2021-06-13','2021-06-13','2021-06-16']}

dff = pd.DataFrame(data=d)
dff
    id1         ID      col1    Goal    col2   col3  Name   Date
0   85643   G-00001     671     NaN     793    500  aasd    2021-06-13
1   85644   G-00001     2       56.0000 4      22   aasd    2021-06-13
2   85643   G-00002     5       78.0000 8      89   aabsd   2021-06-14
3   8564312 G-00002     3       NaN     32     33   aabsd   2021-06-13
4   8564314 G-00002     4       89.0000 43     44   aabsd   2021-06-14
5   85645   G-00001     5       73.0000 55     55   aasd    2021-06-15
6   8564316 G-00002     60      NaN     610    60   aabsd   2021-06-15
7   85646   G-00001     0       NaN     0      1    aasd    2021-06-13
8   8564318 G-00002     0       NaN     0      5    aabsd   2021-06-16
9   85647   G-00001     6       NaN     16     6    aasd    2021-06-13
10  85648   G-00001     3       NaN     23     3    aasd    2021-06-13
11  85649   G-00001     2       34.0000 72     2    aasd    2021-06-13
12  85655   G-00001     4       NaN     48     4    aasd    2021-06-16

I want to summarize some of the columns and add them back to the same data frame based on some ids in the "id1" column. Then I want to give a new name to the "ID" column when we add that row.

Based on below "id1" column ids I want to summarize only "col1","col2","col3",and "Name" columns. Then I want to add that row back to the same data frame and give a new id for the "ID" column. However, I want to do that by the "ID" column for my calculations. (Like groupby)

In the below function I'm aggregating the sum for col1,col2, and col3 with average.

ID_list = ['G-00001','G-00002']
def sumarizeValues(Filter,Orginal):
    b65 = ['85643','85645', '85655','85646'] # for G-00001
    b66 = ['85643','85645','85647','85648','85649','85644'] # for G-00001
    b67 = ['85643','8564312','8564314','8564316','8564318'] # for G-00002

    # create a dictionary
    d_map = {'b65': b65, 'b66': b66, 'b67': b67}
    # dictionary comprehension
    df = pd.DataFrame({k: dff[dff['id1'].isin(v)].agg({'col1': sum, 'col2': sum,
                                                   'col3': 'mean', 'Name': min})
                       for k,v in d_map.items()}).T.reset_index()
    # rename the columns
    df = df.rename(columns={'index': 'ID'})
    # concat the two frames
    #pd.concat([dff, df]).reset_index(drop=True)
    Orginal = pd.concat([Orginal, df]).reset_index(drop=True)
    return Orginal

I only want to create a summarized row if the ID has values in slices. for example in the ID_list first I'm taking 'G-00001' and creating summarized rows based on id1 slicers(b65,b66,b67). However, the function I created giving me some additional rows like below. How can I eliminate those rows?

So final data frame look like below

ID_list = ['G-00001','G-00002']
def abcFunction(dff):
    for ID in ID_list:
        print(ID)
        IDlist =[ID]
        print(IDlist)
        Filter =  dff[dff['ID'].isin(IDlist)]
        dff = sumarizeValues(Filter,dff)
        print(dff)
        ## calculation
        ## calculation
        ## calculation

abcFunction(dff)
# So for the first ID (G-00001), I actually don't need the last row(15th index containing b67).
# I only need that row for the G-00002 calcualtion.

G-00001
['G-00001']
        id1       ID col1    Goal  col2     col3   Name        Date
0     85643  G-00001  671     NaN   793      500   aasd  2021-06-13
1     85644  G-00001    2 56.0000     4       22   aasd  2021-06-13
2     85643  G-00002    5 78.0000     8       89  aabsd  2021-06-14
3   8564312  G-00002    3     NaN    32       33  aabsd  2021-06-13
4   8564314  G-00002    4 89.0000    43       44  aabsd  2021-06-14
5     85645  G-00001    5 73.0000    55       55   aasd  2021-06-15
6   8564316  G-00002   60     NaN   610       60  aabsd  2021-06-15
7     85646  G-00001    0     NaN     0        1   aasd  2021-06-13
8   8564318  G-00002    0     NaN     0        5  aabsd  2021-06-16
9     85647  G-00001    6     NaN    16        6   aasd  2021-06-13
10    85648  G-00001    3     NaN    23        3   aasd  2021-06-13
11    85649  G-00001    2 34.0000    72        2   aasd  2021-06-13
12    85655  G-00001    4     NaN    48        4   aasd  2021-06-16
13      NaN      b65  685     NaN   904 129.8000  aabsd         NaN
14      NaN      b66  694     NaN   971  96.7143  aabsd         NaN
15      NaN      b67  743     NaN  1486 121.8333  aabsd         NaN


# When I ran it for G-00002, it actually contains all the other rows created. 
#So for the second ID (G-00002), I actually don't need the row(13th index to 17th index 
#containing b65,b66, and b67 in index 15). Because G-00002 doesn't 
#contain any values in b65,b66.  

G-00002
['G-00002']
        id1       ID col1    Goal  col2     col3   Name        Date
0     85643  G-00001  671     NaN   793      500   aasd  2021-06-13
1     85644  G-00001    2 56.0000     4       22   aasd  2021-06-13
2     85643  G-00002    5 78.0000     8       89  aabsd  2021-06-14
3   8564312  G-00002    3     NaN    32       33  aabsd  2021-06-13
4   8564314  G-00002    4 89.0000    43       44  aabsd  2021-06-14
5     85645  G-00001    5 73.0000    55       55   aasd  2021-06-15
6   8564316  G-00002   60     NaN   610       60  aabsd  2021-06-15
7     85646  G-00001    0     NaN     0        1   aasd  2021-06-13
8   8564318  G-00002    0     NaN     0        5  aabsd  2021-06-16
9     85647  G-00001    6     NaN    16        6   aasd  2021-06-13
10    85648  G-00001    3     NaN    23        3   aasd  2021-06-13
11    85649  G-00001    2 34.0000    72        2   aasd  2021-06-13
12    85655  G-00001    4     NaN    48        4   aasd  2021-06-16
13      NaN      b65  685     NaN   904 129.8000  aabsd         NaN
14      NaN      b66  694     NaN   971  96.7143  aabsd         NaN
15      NaN      b67  743     NaN  1486 121.8333  aabsd         NaN
16      NaN      b65  685     NaN   904 129.8000  aabsd         NaN
17      NaN      b66  694     NaN   971  96.7143  aabsd         NaN
18      NaN      b67  743     NaN  1486 121.8333  aabsd         NaN

Is it possible to do that? Any help is appreciated! Thanks in advance!

CodePudding user response:

Change your mapping dictionary so each ID is mapped to the required new IDs. Try:

def summarizeValues(df, ID):
    mapper = {"G-00001": {"b65": ['85643', '85645', '85655','85646'],
                         "b66": ['85643', '85645', '85647', '85648', '85649', '85644']},
              "G-00002": {"b67": ['85643', '8564312', '8564314', '8564316', '8564318']}
              }
    
    # dictionary comprehension
    dff = pd.DataFrame({k: df[df['id1'].isin(v)].agg({'col1': sum, 
                                                      'col2': sum, 
                                                      'col3': 'mean', 
                                                      'Name': min})
                        for k,v in mapper[ID].items()})
    
    dff = dff.T.reset_index().rename(columns={'index': 'ID'})
    
    return pd.concat([df, dff]).reset_index(drop=True)

output = dict()
for ID in ['G-00001','G-00002']:
    output[ID] = summarizeValues(df, ID)
Output:
>>> output["G-00001"]
        id1       ID col1  Goal col2       col3   Name        Date
0     85643  G-00001  671   NaN  793        500   aasd  2021-06-13
1     85644  G-00001    2  56.0    4         22   aasd  2021-06-13
2     85643  G-00002    5  78.0    8         89  aabsd  2021-06-14
3   8564312  G-00002    3   NaN   32         33  aabsd  2021-06-13
4   8564314  G-00002    4  89.0   43         44  aabsd  2021-06-14
5     85645  G-00001    5  73.0   55         55   aasd  2021-06-15
6   8564316  G-00002   60   NaN  610         60  aabsd  2021-06-15
7     85646  G-00001    0   NaN    0          1   aasd  2021-06-13
8   8564318  G-00002    0   NaN    0          5  aabsd  2021-06-16
9     85647  G-00001    6   NaN   16          6   aasd  2021-06-13
10    85648  G-00001    3   NaN   23          3   aasd  2021-06-13
11    85649  G-00001    2  34.0   72          2   aasd  2021-06-13
12    85655  G-00001    4   NaN   48          4   aasd  2021-06-16
13      NaN      b65  685   NaN  904      129.8  aabsd         NaN
14      NaN      b66  694   NaN  971  96.714286  aabsd         NaN

>>> output["G-00002"]
        id1       ID col1  Goal  col2        col3   Name        Date
0     85643  G-00001  671   NaN   793         500   aasd  2021-06-13
1     85644  G-00001    2  56.0     4          22   aasd  2021-06-13
2     85643  G-00002    5  78.0     8          89  aabsd  2021-06-14
3   8564312  G-00002    3   NaN    32          33  aabsd  2021-06-13
4   8564314  G-00002    4  89.0    43          44  aabsd  2021-06-14
5     85645  G-00001    5  73.0    55          55   aasd  2021-06-15
6   8564316  G-00002   60   NaN   610          60  aabsd  2021-06-15
7     85646  G-00001    0   NaN     0           1   aasd  2021-06-13
8   8564318  G-00002    0   NaN     0           5  aabsd  2021-06-16
9     85647  G-00001    6   NaN    16           6   aasd  2021-06-13
10    85648  G-00001    3   NaN    23           3   aasd  2021-06-13
11    85649  G-00001    2  34.0    72           2   aasd  2021-06-13
12    85655  G-00001    4   NaN    48           4   aasd  2021-06-16
13      NaN      b67  743   NaN  1486  121.833333  aabsd         NaN
  • Related