How to store two or three column combined unique data in other df and store its index in original df-CodePudding

I have a pandas data frame that looks like this.

df:

   Column1  Column2  Column3
0     DS      4.5     Hard
1     ML      2.5     Medium
2     CS       4      Hard
3     SSC     2.5     Medium
4     SST     4.1     Hard

I want to check if Column2 and Column3(can be more than two columns also) combined is having a duplicate value if yes then need to store the unique value of those columns(in the same pair) in another df and store its index in the original position.

Like in this case we will have output as two df like below:

df:

   Column1  index_Column
0     DS        0   
1     ML        1   
2     CS        2     
3     SSC       1
4     SST       3

df1:

    Column2  Column3
0     4.5     Hard
1     2.5     Medium
2      4      Hard
3     4.1     Hard

CodePudding user response：

Here is one way using pandas.factorize:

# columns to use to get duplicates
cols = ['Column2', 'Column3']

s, v = df[cols].apply(tuple, axis=1).factorize()

# indices
df = df.drop(cols, axis=1).assign(index_col=s)
# references
df1 = pd.DataFrame.from_records(v, columns=cols)

output:

# df
  Column1  index_col
0      DS          0
1      ML          1
2      CS          2
3     SSC          1
4     SST          3

# df1
   Column2 Column3
0      4.5    Hard
1      2.5  Medium
2      4.0    Hard
3      4.1    Hard

CodePudding user response：

Use GroupBy.ngroup for helper column:

df1 = (df[['Column1']].assign(index_Column = df.groupby(['Column2','Column3'], sort=False)
                                               .ngroup()))
print (df1)
  Column1  index_Column
0      DS             0
1      ML             1
2      CS             2
3     SSC             1
4     SST             3

And DataFrame.drop_duplicates:

df2 = df[['Column2','Column3']].drop_duplicates()
print (df2)
   Column2 Column3
0      4.5    Hard
1      2.5  Medium
2      4.0    Hard
4      4.1    Hard

CodePudding user response：

You can simply 1) drop duplicates, specifying those two columns as the subset, and 2) select those two columns.

df1 = df.drop_duplicates(subset=['Column2', 'Column3'])[['Column2', 'Column3']]