I am unable to properly explain my requirement, but I can show the expected result. I have a dataframe that looks like so:
Series1 | Series2 |
---|---|
1370307 | 1370306 |
927092 | 927091 |
925392 | 925391 |
925390 | 925389 |
2344089 | 2344088 |
1827855 | 1827854 |
1715793 | 1715792 |
2356467 | 2356466 |
1463264 | 1463263 |
1712684 | 1712683 |
actual dataframe size: 902811 rows × 2 columns
then another dataframe of unique values of Series2. This I've done using value counts.
df2 = df['Series2'].value_counts().rename_axis('Series2').to_frame('counts').reset_index()
Then I need a list of matching Series1 values for each Series2 value: The expected result is:
Series2 | counts | Series1_List |
---|---|---|
2543113 | 6 | [2543114, 2547568, 2559207, 2563778, 2564330, 2675803] |
2557212 | 6 | [2557213, 2557301, 2559192, 2576080, 2675693, 2712790] |
2432032 | 5 | [2432033, 2444169, 2490928, 2491392, 2528056] |
2559269 | 5 | [2559270, 2576222, 2588034, 2677710, 2713207] |
2439554 | 5 | [2439555, 2441882, 2442272, 2443590, 2443983] |
2335180 | 5 | [2335181, 2398282, 2527060, 2527321, 2565487] |
2494111 | 4 | [2494112, 2495321, 2526026, 2528492] |
2559195 | 4 | [2559196, 2570172, 2634537, 2675718] |
2408775 | 4 | [2408776, 2409117, 2563765, 2564320] |
2408773 | 4 | [2408774, 2409116, 2563764, 2564319] |
I achieve this (although only for a subset of 50 rows) using the following code:
df2.loc[:50,'Series1_List'] = df2.loc[:50,'Series2'].apply(lambda x: df[df['Series2']==x]['Series1'].tolist())
If I do this for the whole dataframe it wouldn't complete even in 20 minutes. So the question is whether there is a faster and efficient method of achieving the result?
CodePudding user response:
IIUC, use:
df2 = (df.groupby('Series2', as_index=False)
.agg(counts=('Series1', 'count'), Series1_List=('Series1', list))
)