Home > database >  Tricky Reverse Aggregate values to unique rows per category in Pandas
Tricky Reverse Aggregate values to unique rows per category in Pandas

Time:12-15

I have a dataset where I would like to de aggregate the values into their own unique rows as well as perform a pivot and grouping by category.

Data

Date         start      end         area    BB_stat AA_stat BB_test AA_test final   
10/1/2022   11/1/2022   12/1/2022   NY      10      80      0       1       1/1/2022    
11/1/2022   12/1/2022   01/1/2023   NY      5       90      1       0       1/1/2022    
10/1/2022   11/1/2022   12/1/2022   CA      6       100     3       1       1/1/2022    
11/1/2022   12/1/2022   01/1/2023   CA      7       0       2       8       1/1/2022    
                                
                                
                                
                                
                                

Desired

#create a new column by string transformation

Date        start       end         type    area    stat    test    final       
10/1/2022   11/1/2022   12/1/2022   BB      NY      10      0       1/1/2022        
11/1/2022   12/1/2022   01/1/2023   BB      NY      5       1       1/1/2022        
10/1/2022   11/1/2022   12/1/2022   AA      NY      80      1       1/1/2022        
11/1/2022   12/1/2022   01/1/2023   AA      NY      90      0       1/1/2022        
10/1/2022   11/1/2022   12/1/2022   BB      CA      6       3       1/1/2022        
11/1/2022   12/1/2022   01/1/2023   BB      CA      7       2       1/1/2022        
10/1/2022   11/1/2022   12/1/2022   AA      CA      100     1       1/1/2022        
11/1/2022   12/1/2022   01/1/2023   AA      CA      0       8       1/1/2022

Doing

#some help from previous SO post/member

df = df.set_index(["Date", "start", "end"])
new_df = pd.concat([pd.Series(c, index=df.index.repeat(df[c]))
                    for c in df]).reset_index(name="type")

# then sort values
new_df = new_df.sort_values(["Date", "start", "end"], ignore_index=True) 

Any suggestion is appreciated

CodePudding user response:

In the code, the value_vars variable is defined as a list of column names that contain the stat and test values. The melt function is then used to unpivot the DataFrame from wide format to long format, using the list of column names in the value_vars variable as the values to unpivot. This results in a new DataFrame where each row contains a single stat or test value.

Next, the str.split method is used to split the values in the "variable" column at the underscore character, resulting in two new columns called "type" and "name". The "variable" column is then dropped.

The resulting DataFrame is split into two half-sized DataFrames, and the two halves are merged together on the common columns, using suffixes to distinguish the columns from each half. The merged DataFrame is then renamed to have "stat" and "test" columns instead of the "value_1" and "value_2" columns. The unnecessary "name_1" and "name_2" columns are also dropped, and the columns are rearranged to have the desired order.

This results in a DataFrame that has the stat and test values in their own unique rows, grouped by category.

value_vars = ["BB_stat", "AA_stat", "BB_test", "AA_test"]
df = df.melt(id_vars=["Date", "start", "end", "area", "final"], value_vars=value_vars)

temp_df = df.variable.str.split("_", 1, expand=True)
df["type"] = temp_df[0]
df["name"] = temp_df[1]
df = df.drop(columns=["variable"])
first_half = df.iloc[:len(df)//2]
second_half = df.iloc[len(df)//2:]
df = pd.merge(first_half, second_half, on=["Date", "start", "end", "area", "final", "type"], suffixes=("_1", "_2"))

df.rename(columns = {'value_2':'test', 'value_1':'stat'}, inplace = True)
df.drop(columns=["name_1", "name_2"], inplace=True)
df = df[["Date", "start", "end", "type", "area", "stat", "test","final"]]

df.sort_values(["area", "type"], ascending=False, inplace=True)
df.to_markdown()

Outputs:

|    | Date      | start     | end       | type   | area   |   stat |   test | final    |
|---:|:----------|:----------|:----------|:-------|:-------|-------:|-------:|:---------|
|  0 | 10/1/2022 | 11/1/2022 | 12/1/2022 | BB     | NY     |     10 |      0 | 1/1/2022 |
|  1 | 11/1/2022 | 12/1/2022 | 01/1/2023 | BB     | NY     |      5 |      1 | 1/1/2022 |
|  4 | 10/1/2022 | 11/1/2022 | 12/1/2022 | AA     | NY     |     80 |      1 | 1/1/2022 |
|  5 | 11/1/2022 | 12/1/2022 | 01/1/2023 | AA     | NY     |     90 |      0 | 1/1/2022 |
|  2 | 10/1/2022 | 11/1/2022 | 12/1/2022 | BB     | CA     |      6 |      3 | 1/1/2022 |
|  3 | 11/1/2022 | 12/1/2022 | 01/1/2023 | BB     | CA     |      7 |      2 | 1/1/2022 |
|  6 | 10/1/2022 | 11/1/2022 | 12/1/2022 | AA     | CA     |    100 |      1 | 1/1/2022 |
|  7 | 11/1/2022 | 12/1/2022 | 01/1/2023 | AA     | CA     |      0 |      8 | 1/1/2022 |
  • Related