How to achieve this encoding in pandas dataframe-CodePudding

I have a dataframe df :

Number	Master
1	Apple
2	Orange
3	Pineapple
4	Strawberrry
5	Blueberry
6	Plums
7	Cherry
8	Dragonfruit
9	Iceapple
10	Litchie

This is just a sample df . original dataframe has 10000 rows. I want to denote Apple,Pineaple,Orange & Strawberry by 1,2,3,4 encoding( which happens to be the fruits with the top 4 value counts in the df) and rest all fruits as one encoding(5). how to achieve this expected :-

Number	Master
1	1
2	3
3	2
4	4
5	5
6	5
7	5
8	5
9	5
10	5

CodePudding user response：

Create dictionary for top N values by counts in column Master by Series.value_counts with Series.head and use them for Series.map with replace not matched values to N 1 in Series.fillna:

N = 4
d = {v: k 1 for k, v in enumerate(df['Master'].value_counts().head(N).index)}
print (d)

df['Master'] = df['Master'].map(d).fillna(N   1).astype(int)

If you have list of top values by list:

L = ['Apple','Pineapple','Orange','Strawberrry']
d = {v: k 1 for k, v in enumerate(L)}
print (d)
{'Apple': 1, 'Pineapple': 2, 'Orange': 3, 'Strawberrry': 4}

df['Master'] = df['Master'].map(d).fillna(len(L)   1).astype(int)
print (df)
   Number  Master
0       1       1
1       2       3
2       3       2
3       4       4
4       5       5
5       6       5
6       7       5
7       8       5
8       9       5
9      10       5

CodePudding user response：

You can use dict and Series.map and fillna(5) for keys that don't exist in dct.

dct = {'Apple':1, 'Pineapple':2,'Orange':3 , 'Strawberrry':4}
df['Master'] = df['Master'].map(dct).fillna(5).astype(int)
print(df)

   Number  Master
0       1       1
1       2       3
2       3       2
3       4       4
4       5       5
5       6       5
6       7       5
7       8       5
8       9       5
9      10       5

CodePudding user response：

One way would also be to create a mapping function and apply it to your column:

dic = { 'col1' : [1,2,3,4,5,6], 'fruits' : ['apple', 'banana', 'tomato','something else', 'apple', 'banana'] }

df = pd.DataFrame.from_dict(dic)

def mapping(row):

    if row == "apple":
        result = 1
    elif row == "banana":
        result = 2
    else:
        result = 5
    return result

df['fruits'] = df['fruits'].apply(mapping)

CodePudding user response：

you can use converters option of pandas to read this csv file.

#creating functions to rename the columns
converter_dict = {'Apple': '1', 'Orange': '2', 'Pineapple': '3', 'Strawberrry': '4'}

def converter_func(x):
    name = converter_dict.keys()
    if x not in name:
        x = x.replace(x,'5')
    else:
        x = x.replace(x,converters_dict[x])
    return x

df = pd.read_csv('file.csv', converters = {"Master": converter_func})

Output:

Number  Master
1        1
2        2
3        3
4        4
5        5
6        5
7        5
8        5
9        5
10       5

for more details on converters parameter you can read medium website or in Official Documentation