A want to create a system where the observations in a variable refer to a number using Python. All the numbers from the (in this case) 5 different variables together form a unique code. The first number corresponds to the first variable. When an observations in a different row is the same as the first, the same number applies. As illustrated in the example, If apple appears in row 1 and 3, both ID's get a '1' as first number.
The output should give a new column with the ID. If all the observations in a row are the same, the ID's will be the same. In the picture below you see 5 variables leading to the unique ID on the right, which should be the output.
CodePudding user response:
You can use pd.factorize
:
df['UniqueID'] = (df.apply(lambda x: (1 pd.factorize(x)[0]).astype(str))
.agg(''.join, axis=1))
print(df)
# Output
Fruit Toy Letter Car Country UniqueID
0 Apple Bear A Ferrari Brazil 11111
1 Strawberry Blocks B Peugeot Chile 22222
2 Apple Blocks C Renault China 12333
3 Orange Bear D Saab China 31443
4 Orange Bear D Ferrari India 31414