Home > Software engineering >  De-duplicate some columns while doing a "hierarchical" one-hot encoding
De-duplicate some columns while doing a "hierarchical" one-hot encoding

Time:08-12

I have a pandas dataframe (df) with columns A, B, C, and D. I have situation in which I wish to de-duplicate the values in the first two columns, one-hot-encode the third column, and do a one-hot-encoding of the last column in such a way that is references

Individual Columns

  • Column A contains identifiers of objects of measurement. These are not currently unique to a row in df, but that is desired.
  • Column B contains floats that are duplicated in each occurrence when values of A repeats. These should be remain paired to
  • Column C contains a categorical variable describing something about each objective of measure.
  • Column D is another categorical variable which describes a property of a given level of C.

Subsets of Columns

  • Every row of df is unique.
  • Not all pairs of values from columns A and B are unique, which is due to mapping to distinct values from C.
  • Every pair of values from columns A and C are unique among the rows.
  • Not all pairs of values from columns A and D are unique.
  • Not all pairs of values from columns C and D are unique.
  • Every triple of values from columns A, B and C are unique.

Example

The last point of about one-hot-encoding the D column requires explanation, which I will attempt in an example. The following data table is analogous to my problem.

Cat_ID Tail_Length Part Colour
1 23.0 nose Brown
1 23.0 belly white
2 22.5 nose pink
2 22.5 belly black
2 22.5 tail black
3 20.0 tail orange

Our first column identifies the cat. I would like there to be one row per cat. The second column gives the length of the cat's tail, which is directly tied to the cat. The tail length need not be unique to a cat, but each cat should have only one tail length. The Part is referring to some body part of the cat. I would like to one-hot the body parts, which I would normally do using pd.get_dummies. Since a cat having multiple body parts is allowed, i.e. not mutually exclusive, it makes sense to aggregate the multiple rows of one-hot encoded parts by summation to get a less sparse 1-0 row vector. The last column, Colour, refers to the colour of the given body part for the cat. I would also like Colour to be one-hot-encoded, but in a way that refers to which body part we are referring to.

Question

Given such a data frame loaded as a pandas data frame object called df, how can I achieve the de-duplication and the (sort of hierarchical) one-hot encoding?

CodePudding user response:

Is this the type of output you want?

enter image description here

import pandas as pd
import io

df = pd.read_csv(io.StringIO(
"""Cat_ID   Tail_Length Part    Colour
1   23.0    nose    Brown
1   23.0    belly   white
2   22.5    nose    pink
2   22.5    belly   black
2   22.5    tail    black
3   20.0    tail    orange"""
),delim_whitespace=True)

onehot_df = df.assign(value=1).pivot_table(
    index = ['Cat_ID','Tail_Length'],
    columns = ['Part','Colour'],
    fill_value = 0,
)

print(onehot_df)
  • Related