I have a pandas data frame that looks like this.
df:
Column1 Column2 Column3
0 DS 4.5 Hard
1 ML 2.5 Medium
2 CS 4 Hard
3 SSC 2.5 Medium
4 SST 4.1 Hard
I want to check if Column2 and Column3(can be more than two columns also) combined is having a duplicate value if yes then need to store the unique value of those columns(in the same pair) in another df and store its index in the original position.
Like in this case we will have output as two df like below:
df:
Column1 index_Column
0 DS 0
1 ML 1
2 CS 2
3 SSC 1
4 SST 3
df1:
Column2 Column3
0 4.5 Hard
1 2.5 Medium
2 4 Hard
3 4.1 Hard
CodePudding user response:
Here is one way using pandas.factorize
:
# columns to use to get duplicates
cols = ['Column2', 'Column3']
s, v = df[cols].apply(tuple, axis=1).factorize()
# indices
df = df.drop(cols, axis=1).assign(index_col=s)
# references
df1 = pd.DataFrame.from_records(v, columns=cols)
output:
# df
Column1 index_col
0 DS 0
1 ML 1
2 CS 2
3 SSC 1
4 SST 3
# df1
Column2 Column3
0 4.5 Hard
1 2.5 Medium
2 4.0 Hard
3 4.1 Hard
CodePudding user response:
Use GroupBy.ngroup
for helper column:
df1 = (df[['Column1']].assign(index_Column = df.groupby(['Column2','Column3'], sort=False)
.ngroup()))
print (df1)
Column1 index_Column
0 DS 0
1 ML 1
2 CS 2
3 SSC 1
4 SST 3
And DataFrame.drop_duplicates
:
df2 = df[['Column2','Column3']].drop_duplicates()
print (df2)
Column2 Column3
0 4.5 Hard
1 2.5 Medium
2 4.0 Hard
4 4.1 Hard
CodePudding user response:
You can simply 1) drop duplicates, specifying those two columns as the subset, and 2) select those two columns.
df1 = df.drop_duplicates(subset=['Column2', 'Column3'])[['Column2', 'Column3']]