I have the below pandas data frame.
d = {'id1': ['85643', '85644','85643','8564312','8564314','85645','8564316','85646','8564318','85647','85648','85649','85655'],'ID': ['G-00001', 'G-00001','G-00002','G-00002','G-00002','G-00001','G-00002','G-00001','G-00002','G-00001','G-00001','G-00001','G-00001'],'col1': [671, 2,5,3,4,5,60,0,0,6,3,2,4],'Goal': [np.nan, 56,78,np.nan,89,73,np.nan ,np.nan ,np.nan, np.nan, np.nan, 34,np.nan ], 'col2': [793, 4,8,32,43,55,610,0,0,16,23,72,48],'col3': [500, 22,89,33,44,55,60,1,5,6,3,2,4],'Name': ['aasd', 'aasd','aabsd','aabsd','aabsd','aasd','aabsd','aasd','aabsd','aasd','aasd','aasd','aasd'],'Date': ['2021-06-13', '2021-06-13','2021-06-14','2021-06-13','2021-06-14','2021-06-15','2021-06-15','2021-06-13','2021-06-16','2021-06-13','2021-06-13','2021-06-13','2021-06-16']}
dff = pd.DataFrame(data=d)
dff
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0000 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0000 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0000 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0000 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0000 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
I want to summarize some of the columns and add them back to the same data frame based on some ids in the "id1" column. Then I want to give a new name to the "ID" column when we add that row.
Based on below "id1" column ids I want to summarize only "col1","col2","col3",and "Name" columns. Then I want to add that row back to the same data frame and give a new id for the "ID" column. However, I want to do that by the "ID" column for my calculations. (Like groupby)
In the below function I'm aggregating the sum for col1,col2, and col3 with average.
ID_list = ['G-00001','G-00002']
def sumarizeValues(Filter,Orginal):
b65 = ['85643','85645', '85655','85646'] # for G-00001
b66 = ['85643','85645','85647','85648','85649','85644'] # for G-00001
b67 = ['85643','8564312','8564314','8564316','8564318'] # for G-00002
# create a dictionary
d_map = {'b65': b65, 'b66': b66, 'b67': b67}
# dictionary comprehension
df = pd.DataFrame({k: dff[dff['id1'].isin(v)].agg({'col1': sum, 'col2': sum,
'col3': 'mean', 'Name': min})
for k,v in d_map.items()}).T.reset_index()
# rename the columns
df = df.rename(columns={'index': 'ID'})
# concat the two frames
#pd.concat([dff, df]).reset_index(drop=True)
Orginal = pd.concat([Orginal, df]).reset_index(drop=True)
return Orginal
I only want to create a summarized row if the ID has values in slices. for example in the ID_list first I'm taking 'G-00001' and creating summarized rows based on id1 slicers(b65,b66,b67). However, the function I created giving me some additional rows like below. How can I eliminate those rows?
So final data frame look like below
ID_list = ['G-00001','G-00002']
def abcFunction(dff):
for ID in ID_list:
print(ID)
IDlist =[ID]
print(IDlist)
Filter = dff[dff['ID'].isin(IDlist)]
dff = sumarizeValues(Filter,dff)
print(dff)
## calculation
## calculation
## calculation
abcFunction(dff)
# So for the first ID (G-00001), I actually don't need the last row(15th index containing b67).
# I only need that row for the G-00002 calcualtion.
G-00001
['G-00001']
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0000 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0000 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0000 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0000 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0000 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
13 NaN b65 685 NaN 904 129.8000 aabsd NaN
14 NaN b66 694 NaN 971 96.7143 aabsd NaN
15 NaN b67 743 NaN 1486 121.8333 aabsd NaN
# When I ran it for G-00002, it actually contains all the other rows created.
#So for the second ID (G-00002), I actually don't need the row(13th index to 17th index
#containing b65,b66, and b67 in index 15). Because G-00002 doesn't
#contain any values in b65,b66.
G-00002
['G-00002']
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0000 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0000 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0000 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0000 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0000 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
13 NaN b65 685 NaN 904 129.8000 aabsd NaN
14 NaN b66 694 NaN 971 96.7143 aabsd NaN
15 NaN b67 743 NaN 1486 121.8333 aabsd NaN
16 NaN b65 685 NaN 904 129.8000 aabsd NaN
17 NaN b66 694 NaN 971 96.7143 aabsd NaN
18 NaN b67 743 NaN 1486 121.8333 aabsd NaN
Is it possible to do that? Any help is appreciated! Thanks in advance!
CodePudding user response:
Change your mapping dictionary so each ID is mapped to the required new IDs. Try:
def summarizeValues(df, ID):
mapper = {"G-00001": {"b65": ['85643', '85645', '85655','85646'],
"b66": ['85643', '85645', '85647', '85648', '85649', '85644']},
"G-00002": {"b67": ['85643', '8564312', '8564314', '8564316', '8564318']}
}
# dictionary comprehension
dff = pd.DataFrame({k: df[df['id1'].isin(v)].agg({'col1': sum,
'col2': sum,
'col3': 'mean',
'Name': min})
for k,v in mapper[ID].items()})
dff = dff.T.reset_index().rename(columns={'index': 'ID'})
return pd.concat([df, dff]).reset_index(drop=True)
output = dict()
for ID in ['G-00001','G-00002']:
output[ID] = summarizeValues(df, ID)
Output:
>>> output["G-00001"]
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
13 NaN b65 685 NaN 904 129.8 aabsd NaN
14 NaN b66 694 NaN 971 96.714286 aabsd NaN
>>> output["G-00002"]
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
13 NaN b67 743 NaN 1486 121.833333 aabsd NaN