Home > Back-end >  Group by Aggregate column - Making sure they follow a specific format
Group by Aggregate column - Making sure they follow a specific format

Time:06-09

I have a data frame like so:

Input:

year ip   type
2020 101  Missing
2021 101  Type 1
2022 101  Type 2
2020 102  Missing
2021 102  Missing
2020 103  Missing
2021 103  Type 2
2021 104  Type 1
2022 104  Type 2
2022 104  Type 2

How can I convert my data frame to the following:

Expected Output:

ip  type
101 Missing/Type 1/Type 2
102 Missing
103 Missing/Type 2
104 Type 1/Type 2

I have the following solution, but it's not completely correct/what i want.

    out = df.drop_duplicates(['ip','type']).groupby('ip')['type'].agg('/'.join).reset_index()
    Out[638]: 
        ip                 type
    0  101  Missing/Type1/Type2
    1  102              Missing
    2  103        Missing/Type2
    3  104          Type1/Type2

For example the only accepted values for type i want are:

  1. Missing
  2. Type 1
  3. Type 2
  4. Missing/Type 1/Type 2
  5. Missing/Type 1
  6. Missing/Type 2
  7. Type 1/Type 2

With this solution, some instances will have Type 1/Missing or Type 2/Type 1 which I do not want. I would want to alter Type 1/Missing to Missing/Type 1 and alter Type 2/Type 1 to Type 1/Type 2.

I hope this makes sense. I can clarify if not.

CodePudding user response:

We can do pd.Categorical and get the dataframe sort, notice here I am using the unique with agg

order = pd.Categorical(df.type,['Missing','Type1','Type2']).argsort()
out = df.iloc[order].groupby('ip',as_index=False)[['type']].agg(lambda x : '/'.join(x.unique()))
out
Out[563]: 
    ip                 type
0  101  Missing/Type1/Type2
1  102              Missing
2  103        Missing/Type2
3  104          Type1/Type2

CodePudding user response:

You can use value_counts to determine the unique value pairs. Then sort those into ip-type order. Leverages the fact that the type order you want happens to be in alpha order.

Extended DF that introduces out-of-order entries: 
    year   ip     type
0   2020  101  Missing
1   2021  101   Type 1
2   2022  101   Type 2
3   2020  102  Missing
4   2021  102  Missing
5   2020  103  Missing
6   2021  103   Type 2
7   2021  104   Type 1
8   2022  104   Type 2
9   2022  104   Type 2
10  2021  105   Type 2
11  2022  105   Type 1
12  2022  105  Missing
13  2021  106   Type 2
14  2022  106   Type 1

df[['ip','type']].value_counts().sort_index().reset_index()[['ip','type']] \
    .groupby('ip').agg('/'.join).reset_index()  

    ip                   type
0  101  Missing/Type 1/Type 2
1  102                Missing
2  103         Missing/Type 2
3  104          Type 1/Type 2
4  105  Missing/Type 1/Type 2
5  106          Type 1/Type 2

This should handle any order of ip-type combinations.

  • Related