I have a data frame like so:
Input:
year ip type
2020 101 Missing
2021 101 Type 1
2022 101 Type 2
2020 102 Missing
2021 102 Missing
2020 103 Missing
2021 103 Type 2
2021 104 Type 1
2022 104 Type 2
2022 104 Type 2
How can I convert my data frame to the following:
Expected Output:
ip type
101 Missing/Type 1/Type 2
102 Missing
103 Missing/Type 2
104 Type 1/Type 2
I have the following solution, but it's not completely correct/what i want.
out = df.drop_duplicates(['ip','type']).groupby('ip')['type'].agg('/'.join).reset_index()
Out[638]:
ip type
0 101 Missing/Type1/Type2
1 102 Missing
2 103 Missing/Type2
3 104 Type1/Type2
For example the only accepted values for type i want are:
- Missing
- Type 1
- Type 2
- Missing/Type 1/Type 2
- Missing/Type 1
- Missing/Type 2
- Type 1/Type 2
With this solution, some instances will have Type 1/Missing or Type 2/Type 1 which I do not want. I would want to alter Type 1/Missing to Missing/Type 1 and alter Type 2/Type 1 to Type 1/Type 2.
I hope this makes sense. I can clarify if not.
CodePudding user response:
We can do pd.Categorical
and get the dataframe sort, notice here I am using the unique
with agg
order = pd.Categorical(df.type,['Missing','Type1','Type2']).argsort()
out = df.iloc[order].groupby('ip',as_index=False)[['type']].agg(lambda x : '/'.join(x.unique()))
out
Out[563]:
ip type
0 101 Missing/Type1/Type2
1 102 Missing
2 103 Missing/Type2
3 104 Type1/Type2
CodePudding user response:
You can use value_counts to determine the unique value pairs. Then sort those into ip-type order. Leverages the fact that the type order you want happens to be in alpha order.
Extended DF that introduces out-of-order entries:
year ip type
0 2020 101 Missing
1 2021 101 Type 1
2 2022 101 Type 2
3 2020 102 Missing
4 2021 102 Missing
5 2020 103 Missing
6 2021 103 Type 2
7 2021 104 Type 1
8 2022 104 Type 2
9 2022 104 Type 2
10 2021 105 Type 2
11 2022 105 Type 1
12 2022 105 Missing
13 2021 106 Type 2
14 2022 106 Type 1
df[['ip','type']].value_counts().sort_index().reset_index()[['ip','type']] \
.groupby('ip').agg('/'.join).reset_index()
ip type
0 101 Missing/Type 1/Type 2
1 102 Missing
2 103 Missing/Type 2
3 104 Type 1/Type 2
4 105 Missing/Type 1/Type 2
5 106 Type 1/Type 2
This should handle any order of ip-type combinations.