Home > Software engineering >  How to transform combinations of values in columns into individual columns?
How to transform combinations of values in columns into individual columns?

Time:05-05

I have a dataset (df), that looks like this:

Date ID County Name State State Name Product Name Type of Transaction QTY
202105 10001 Los Angeles CA California Shoes Entry 630
202012 10002 Houston TX Texas Keyboard Exit 5493
202001 11684 Chicago IL Illionis Phone Disposal 220
202107 12005 New York NY New York Phone Entry 302
... ... ... ... ... ... ... ...
202111 14990 Orlando FL Florida Shoes Exit 201

For every county, there are multiple entries for different Products, types of transactions, and at different dates, but not all counties have the same number of entries and they don't follow the same dates.

I want to recreate this dataset, such that: 1 - All counties have the same start and end dates, and for those dates where the county does not record entries, I want this entry to be recorded as NaN. 2 - The product names and their types are their own columns.

Essentially, this is how the dataset needs to look:

Date ID County Name State State Name Shoes, Entry Shoes, Exit Shoes, Disposal Phones, Entry Phones, Exit Phones, Disposal Keyboard, Entry Keyboard, Exit Keyboard, Disposal
202105 10001 Los Angeles CA California 594 694 5660 33299 1110 5659 4559 3223 56889
202012 10002 Houston TX Texas 3420 4439 549 2110 5669 2245 39294 3345 556
202001 11684 Chicago IL Illionis 55432 4439 329 21190 4320 455 34059 44556 5677
202107 12005 New York NY New York 34556 2204 4329 11193 22345 43221 1544 3467 22450
... ... ... ... ... ... ... ... ... ... ... ... ... ...
202111 14990 Orlando FL Florida 54543 23059 3290 21394 34335 59660 NaN NaN NaN

Under the example, you can see how Florida does not record certain transactions. I would like to add the NaN such that the dataframe looks like this. I appreciate all the help!

CodePudding user response:

This is essentially a pivot, with flattening of the MultiIndex:

(df
 .pivot(index=['Date', 'ID', 'County Name', 'State', 'State Name'],
        columns=['Product Name', 'Type of Transaction'],
        values='QTY')
 .pipe(lambda d: d.set_axis(map(','.join, d. columns), axis=1))
 .reset_index()
 )

Output:

     Date     ID  County Name State  State Name  Shoes,Entry  Keyboard,Exit  \
0  202001  11684      Chicago    IL    Illionis          NaN            NaN   
1  202012  10002      Houston    TX       Texas          NaN         5493.0   
2  202105  10001  Los Angeles    CA  California        630.0            NaN   
3  202107  12005     New York    NY    New York          NaN            NaN   

   Phone,Disposal  Phone,Entry  
0           220.0          NaN  
1             NaN          NaN  
2             NaN          NaN  
3             NaN        302.0  
  • Related