Home > Back-end >  Pandas group by and keep ID
Pandas group by and keep ID

Time:10-02

I have a data frame like this:

temp_id administ_div municipality combine
1 Gunma Maebashi
NaN Gunma Maebashi CC
NaN Gunma Maebashi DD
NaN Gunma Maebashi EE
2 Gunma Kiryu AA

How do I group by administ_div, municipality to this (temp_id must be not NaN):

temp_id administ_div municipality combine
1 Gunma Maebashi [CC,DD,EE]
2 Gunma Kiryu AA

CodePudding user response:

You can use a custom groupby.agg with help of ffill:

# aggregate all columns with first value
agg_funcs = {c: 'first' for c in df}
# except "combine" as list of non-null values
agg_funcs['combine'] = lambda l: [x for x in l if x]

# group by filled values and aggregate
out = (df
 .groupby(df['temp_id'].ffill(), as_index=False)
 .agg(agg_funcs)
)

output:

   temp_id administ_div municipality       combine
0      1.0        Gunma     Maebashi  [CC, DD, EE]
1      2.0        Gunma        Kiryu          [AA]

CodePudding user response:

try:

df['temp_id'] = df['temp_id'].ffill()
df1 = df.groupby(['temp_id', 'administ_div', 'municipality'])['combine'].apply(lambda x: [i for i in x if not pd.isna(i)][0] if len(x)==1 else [i for i in x if not pd.isna(i)]).reset_index()

df1
    temp_id administ_div    municipality    combine
0   1.0     Gunma           Maebashi        [CC, DD, EE]
1   2.0     Gunma           Kiryu           AA

CodePudding user response:

A solution with pandas.DataFrame.groupby and pandas.Series.dropna combined :

new_df = (
            df.groupby(['administ_div', 'municipality'], as_index=False)
              .agg(lambda x: list(x.dropna()) if len(x)>1 else x)
              .assign(temp_id = lambda y: y['temp_id'].astype(str).str.strip('[]').astype(float).astype(int))
              .reindex(columns=df.columns)
              .sort_values(by='temp_id')
         )

# Output :

print(new_df) 

   temp_id administ_div municipality       combine
1        1        Gunma     Maebashi  [CC, DD, EE]
0        2        Gunma        Kiryu            AA

CodePudding user response:

out = (df.assign(temp_id=df.temp_id.ffill())
         .groupby(['temp_id', 'administ_div', 'municipality'], as_index=False)['combine']
         .agg(lambda x: x.dropna().tolist()))
print(out)

Output:

   temp_id administ_div municipality       combine
0      1.0        Gunma     Maebashi  [CC, DD, EE]
1      2.0        Gunma        Kiryu          [AA]
  • Related