I have a dataset (df), that looks like this:
Date | ID | County Name | State | State Name | Product Name | Type of Transaction | QTY |
---|---|---|---|---|---|---|---|
202105 | 10001 | Los Angeles | CA | California | Shoes | Entry | 630 |
202012 | 10002 | Houston | TX | Texas | Keyboard | Exit | 5493 |
202001 | 11684 | Chicago | IL | Illionis | Phone | Disposal | 220 |
202107 | 12005 | New York | NY | New York | Phone | Entry | 302 |
... | ... | ... | ... | ... | ... | ... | ... |
202111 | 14990 | Orlando | FL | Florida | Shoes | Exit | 201 |
For every county, there are multiple entries for different Products, types of transactions, and at different dates, but not all counties have the same number of entries and they don't follow the same dates.
I want to recreate this dataset, such that: 1 - All counties have the same start and end dates, and for those dates where the county does not record entries, I want this entry to be recorded as NaN. 2 - The product names and their types are their own columns.
Essentially, this is how the dataset needs to look:
Date | ID | County Name | State | State Name | Shoes, Entry | Shoes, Exit | Shoes, Disposal | Phones, Entry | Phones, Exit | Phones, Disposal | Keyboard, Entry | Keyboard, Exit | Keyboard, Disposal |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
202105 | 10001 | Los Angeles | CA | California | 594 | 694 | 5660 | 33299 | 1110 | 5659 | 4559 | 3223 | 56889 |
202012 | 10002 | Houston | TX | Texas | 3420 | 4439 | 549 | 2110 | 5669 | 2245 | 39294 | 3345 | 556 |
202001 | 11684 | Chicago | IL | Illionis | 55432 | 4439 | 329 | 21190 | 4320 | 455 | 34059 | 44556 | 5677 |
202107 | 12005 | New York | NY | New York | 34556 | 2204 | 4329 | 11193 | 22345 | 43221 | 1544 | 3467 | 22450 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
202111 | 14990 | Orlando | FL | Florida | 54543 | 23059 | 3290 | 21394 | 34335 | 59660 | NaN | NaN | NaN |
Under the example, you can see how Florida does not record certain transactions. I would like to add the NaN such that the dataframe looks like this. I appreciate all the help!
CodePudding user response:
This is essentially a pivot
, with flattening of the MultiIndex:
(df
.pivot(index=['Date', 'ID', 'County Name', 'State', 'State Name'],
columns=['Product Name', 'Type of Transaction'],
values='QTY')
.pipe(lambda d: d.set_axis(map(','.join, d. columns), axis=1))
.reset_index()
)
Output:
Date ID County Name State State Name Shoes,Entry Keyboard,Exit \
0 202001 11684 Chicago IL Illionis NaN NaN
1 202012 10002 Houston TX Texas NaN 5493.0
2 202105 10001 Los Angeles CA California 630.0 NaN
3 202107 12005 New York NY New York NaN NaN
Phone,Disposal Phone,Entry
0 220.0 NaN
1 NaN NaN
2 NaN NaN
3 NaN 302.0