Home > Mobile >  De-aggregate values to unique rows using Python
De-aggregate values to unique rows using Python

Time:09-20

I have a dataset where I would like to de aggregate the values into their own unique rows as well as perform a pivot.

Data

ID  type    Q1 24   Q2 24   
AA  hello   1       1   
AA  hi      2       1   
AA  bye     1       0   
AA  ok      0       1   
BB  hello   1       0   
BB  hi      1       0   
BB  bye     0       1   
BB  ok      1       1   

Desired

ID  date    type
AA  Q1 24   hello
AA  Q1 24   hi
AA  Q1 24   hi
AA  Q1 24   bye
AA  Q2 24   hello
AA  Q2 24   hi
AA  Q2 24   ok
BB  Q1 24   hello
BB  Q1 24   hi
BB  Q1 24   ok
BB  Q1 24   bye
BB  Q2 24   ok

Doing

import pandas as pd
import janitor

df.pivot_longer(
    index = 'ID', 
    names_to = ('date', '.value'), 
    names_pattern = r"(Q\d )_?(. )",
    sort_by_appearance = True)

I am currently researching, any suggestion is appreciated.

CodePudding user response:

For this use case, you do not need the .value for pivot_longer; the .value is used if you wish to retain part of the column header, in this case you are not, you are simply flipping everything into long form and filtering for rows where the value is greater than 0 :

# pip install pyjanitor
import pandas as pd
import janitor

(df
.pivot_longer(
    index = ['ID', 'type'], 
    names_to = 'date')
.loc[lambda df: df.value.gt(0), ['ID', 'type', 'date']]
)
    ID   type   date
0   AA  hello  Q1 24
1   AA     hi  Q1 24
2   AA    bye  Q1 24
4   BB  hello  Q1 24
5   BB     hi  Q1 24
7   BB     ok  Q1 24
8   AA  hello  Q2 24
9   AA     hi  Q2 24
11  AA     ok  Q2 24
14  BB    bye  Q2 24
15  BB     ok  Q2 24

Of course, this transformation is pretty simple and you don't even need pivot_longer, you can do it with pd.melt:

(df
.melt(
    ['ID', 'type'], 
    var_name='date')
.loc[lambda df: df.value.gt(0), ['ID', 'type', 'date']]
)
    ID   type   date
0   AA  hello  Q1 24
1   AA     hi  Q1 24
2   AA    bye  Q1 24
4   BB  hello  Q1 24
5   BB     hi  Q1 24
7   BB     ok  Q1 24
8   AA  hello  Q2 24
9   AA     hi  Q2 24
11  AA     ok  Q2 24
14  BB    bye  Q2 24
15  BB     ok  Q2 24

CodePudding user response:

IIUC, one way using pandas.Index.repeat with pandas.concat:

df = df.set_index(["ID", "type"])
new_df = pd.concat([pd.Series(c, index=df.index.repeat(df[c]))
                    for c in df]).reset_index(name="date")

# Optionally, sort values
new_df = new_df.sort_values(["ID", "date"], ignore_index=True) 

Output:

    ID   type   date
0   AA  hello  Q1 24
1   AA     hi  Q1 24
2   AA     hi  Q1 24
3   AA    bye  Q1 24
4   AA  hello  Q2 24
5   AA     hi  Q2 24
6   AA     ok  Q2 24
7   BB  hello  Q1 24
8   BB     hi  Q1 24
9   BB     ok  Q1 24
10  BB    bye  Q2 24
11  BB     ok  Q2 24
  • Related