melt() function duplicating dataset-CodePudding

I have a table like this:

id	name	doggo	floofer	puppo	pupper
1	rowa	NaN	NaN	NaN	NaN
2	ray	NaN	NaN	NaN	NaN
3	emma	NaN	NaN	NaN	pupper
4	sophy	doggo	NaN	NaN	NaN
5	jack	NaN	NaN	NaN	NaN
6	jimmy	NaN	NaN	puppo	NaN
7	bingo	NaN	NaN	NaN	NaN
8	billy	NaN	NaN	NaN	pupper
9	tiger	NaN	floofer	NaN	NaN
10	lucy	NaN	NaN	NaN	NaN

I want the (doggo, floofer, puppo, pupper) columns to be in a single category column (dog_type).

Note: The NaN should also be NaN in the column since not all the dogs were categorized.

But after using:

df1 = df.melt(id_vars = ['id', 'name'], value_vars = ['doggo', 'floofer', 'pupper', 'puppo'], var_name = 'dog_types', ignore_index = True)

The melted df is now duplicated to 40 rows:

    id   name dog_types    value
0    1   rowa     doggo      NaN
1    2    ray     doggo      NaN
2    3   emma     doggo      NaN
3    4  sophy     doggo    doggo
4    5   jack     doggo      NaN
5    6  jimmy     doggo      NaN
6    7  bingo     doggo      NaN
7    8  billy     doggo      NaN
8    9  tiger     doggo      NaN
9   10   lucy     doggo      NaN
10   1   rowa   floofer      NaN
11   2    ray   floofer      NaN
12   3   emma   floofer      NaN
13   4  sophy   floofer      NaN
14   5   jack   floofer      NaN
15   6  jimmy   floofer      NaN
16   7  bingo   floofer      NaN
17   8  billy   floofer      NaN
18   9  tiger   floofer  floofer
19  10   lucy   floofer      NaN
20   1   rowa    pupper      NaN
21   2    ray    pupper      NaN
22   3   emma    pupper   pupper
23   4  sophy    pupper      NaN
24   5   jack    pupper      NaN
25   6  jimmy    pupper      NaN
26   7  bingo    pupper      NaN
27   8  billy    pupper   pupper
28   9  tiger    pupper      NaN
29  10   lucy    pupper      NaN
30   1   rowa     puppo      NaN
31   2    ray     puppo      NaN
32   3   emma     puppo      NaN
33   4  sophy     puppo      NaN
34   5   jack     puppo      NaN
35   6  jimmy     puppo    puppo
36   7  bingo     puppo      NaN
37   8  billy     puppo      NaN
38   9  tiger     puppo      NaN
39  10   lucy     puppo      NaN

How I do get the correct results without duplicates?

CodePudding user response：

df['dog_types'] = (df['doggo'].fillna(df['floofer'])
                              .fillna(df['puppo'])
                              .fillna(df['pupper']))

   id   name  doggo  floofer  puppo  pupper dog_types
0   1   rowa    NaN      NaN    NaN     NaN       NaN
1   2    ray    NaN      NaN    NaN     NaN       NaN
2   3   emma    NaN      NaN    NaN  pupper    pupper
3   4  sophy  doggo      NaN    NaN     NaN     doggo
4   5   jack    NaN      NaN    NaN     NaN       NaN
5   6  jimmy    NaN      NaN  puppo     NaN     puppo
6   7  bingo    NaN      NaN    NaN     NaN       NaN
7   8  billy    NaN      NaN    NaN  pupper    pupper
8   9  tiger    NaN  floofer    NaN     NaN   floofer
9  10   lucy    NaN      NaN    NaN     NaN       NaN

Afterwards you can drop redundant columns:

df.drop(columns=['doggo', 'floofer', 'puppo', 'pupper'], inplace=True)

   id   name dog_types
0   1   rowa       NaN
1   2    ray       NaN
2   3   emma    pupper
3   4  sophy     doggo
4   5   jack       NaN
5   6  jimmy     puppo
6   7  bingo       NaN
7   8  billy    pupper
8   9  tiger   floofer
9  10   lucy       NaN

CodePudding user response：

Given your current structure, we can make dog_type like this:

df['dog_type'] = df.bfill(axis=1).doggo
df = df.drop(columns=['doggo', 'floofer', 'puppo', 'pupper'])

print(df)

Output:

   id   name dog_type
0   1   rowa      NaN
1   2    ray      NaN
2   3   emma   pupper
3   4  sophy    doggo
4   5   jack      NaN
5   6  jimmy    puppo
6   7  bingo      NaN
7   8  billy   pupper
8   9  tiger  floofer
9  10   lucy      NaN

CodePudding user response：

You may just try

l = ['doggo', 'floofer', 'pupper', 'puppo']
df['new'] = df[l].bfill(axis=1).iloc[:,0]

CodePudding user response：

You can use .stack():

cols = ['doggo', 'floofer', 'puppo', 'pupper']

1. If each row has NO MORE than 1 species:

df['dog_types'] = df[cols].stack().droplevel(1)

df['dog_types']
0        NaN
1        NaN
2     pupper
3      doggo
4        NaN
5      puppo
6        NaN
7     pupper
8    floofer
9        NaN
Name: dog_types, dtype: object

2. If a row can have more than 1 species:

You can choose either the first or the last one (just set the keep parameter to either 'first' or 'last'):

Example:

df.iloc[2,2] = 'mine'


df.loc[[2], cols]  # the second row has multiple species
  doggo floofer puppo  pupper
2  mine     NaN   NaN  pupper

Solution:

If you try using the first method in this case, you will get a ValueError: cannot reindex on an axis with duplicate labels. So instead, use this:

res = df[cols].stack().droplevel(1)
res = res[~res.index.duplicated(keep='first')]
df['dog_types'] = res

df['dog_types']
0        NaN
1        NaN
2       mine
3      doggo
4        NaN
5      puppo
6        NaN
7     pupper
8    floofer
9        NaN
Name: dog_types, dtype: object