Home > Blockchain >  Python, Pandas. How to assign grouping number based on number of observations?
Python, Pandas. How to assign grouping number based on number of observations?

Time:08-26

I want to assign an unique ID per product and it's subcategories. Input data look like this:

d = {'Manufacturer': ['Samsung','Samsung','Siemens','Siemens','Siemens','Apple','Apple'],
     'Product': ['Phone','Phone','Computer','Sensor','Sensor','Phone','MacBook']}
df = pd.DataFrame(data=d)


Manufacturer    Product
0   Samsung Phone
1   Samsung Phone
2   Siemens Computer
3   Siemens Sensor
4   Siemens Sensor
5   Apple   Phone
6   Apple   MacBook

I want that UNIQUE_ID will be informing about Manufacturer and Product so I came up with such solution:

df['ID_Manufacturer'] = df.groupby(['Manufacturer']).ngroup()
df['ID_Product'] = df.groupby(['Product']).ngroup()

columns = ['ID_Manufacturer', 'ID_Product']
df[columns] = df[columns].astype(str)
df['UNIQUE_ID'] = df[columns].apply(lambda x: '.'.join(x[x.notnull()]), axis = 1)
df.drop(['ID_Manufacturer', 'ID_Product'], axis = 1)

Resulting in:

    Manufacturer    Product   UNIQUE_ID
0   Samsung         Phone     1.2
1   Samsung         Phone     1.2
2   Siemens         Computer  2.0
3   Siemens         Sensor    2.3
4   Siemens         Sensor    2.3
5   Apple           Phone     0.2
6   Apple           MacBook   0.1

However, what I want to achieve more is:

  • ID numbers starting from 1 not 0
  • Group with most observations have assigned lower number (currently is alphabetically)

So final output should look like this:

    Manufacturer    Product   UNIQUE_ID
0   Samsung         Phone     3.1
1   Samsung         Phone     3.1
2   Siemens         Computer  1.2
3   Siemens         Sensor    1.3
4   Siemens         Sensor    1.3
5   Apple           Phone     2.4
6   Apple           MacBook   2.3

CodePudding user response:

You can use pandas.factorize on the output of value_counts (sorted by default in descending order of frequencies):

id1, val1 = pd.factorize(df['Manufacturer'].value_counts().index)
id2, val2 = pd.factorize(df['Product'].value_counts().index)

df['UNIQUE_ID'] = (
    df['Manufacturer'].map(pd.Series(id1 1, index=val1).astype(str))
     '.' 
    df['Product'].map(pd.Series(id2 1, index=val2).astype(str))
)

output:

  Manufacturer   Product UNIQUE_ID
0      Samsung     Phone       2.1
1      Samsung     Phone       2.1
2      Siemens  Computer       1.3
3      Siemens    Sensor       1.2
4      Siemens    Sensor       1.2
5        Apple     Phone       3.1
6        Apple   MacBook       3.4
  • Related