I have a dataset where I would like to de aggregate the values into their own unique rows as well as perform a pivot and grouping by category.
Data
Date start end area BB_stat AA_stat BB_test AA_test final
10/1/2022 11/1/2022 12/1/2022 NY 10 80 0 1 1/1/2022
11/1/2022 12/1/2022 01/1/2023 NY 5 90 1 0 1/1/2022
10/1/2022 11/1/2022 12/1/2022 CA 6 100 3 1 1/1/2022
11/1/2022 12/1/2022 01/1/2023 CA 7 0 2 8 1/1/2022
Desired
#create a new column by string transformation
Date start end type area stat test final
10/1/2022 11/1/2022 12/1/2022 BB NY 10 0 1/1/2022
11/1/2022 12/1/2022 01/1/2023 BB NY 5 1 1/1/2022
10/1/2022 11/1/2022 12/1/2022 AA NY 80 1 1/1/2022
11/1/2022 12/1/2022 01/1/2023 AA NY 90 0 1/1/2022
10/1/2022 11/1/2022 12/1/2022 BB CA 6 3 1/1/2022
11/1/2022 12/1/2022 01/1/2023 BB CA 7 2 1/1/2022
10/1/2022 11/1/2022 12/1/2022 AA CA 100 1 1/1/2022
11/1/2022 12/1/2022 01/1/2023 AA CA 0 8 1/1/2022
Doing
#some help from previous SO post/member
df = df.set_index(["Date", "start", "end"])
new_df = pd.concat([pd.Series(c, index=df.index.repeat(df[c]))
for c in df]).reset_index(name="type")
# then sort values
new_df = new_df.sort_values(["Date", "start", "end"], ignore_index=True)
Any suggestion is appreciated
CodePudding user response:
In the code, the value_vars
variable is defined as a list of column names that contain the stat and test values. The melt
function is then used to unpivot the DataFrame from wide format to long format, using the list of column names in the value_vars variable as the values to unpivot. This results in a new DataFrame where each row contains a single stat or test value.
Next, the str.split
method is used to split the values in the "variable" column at the underscore character, resulting in two new columns called "type" and "name". The "variable" column is then dropped.
The resulting DataFrame is split into two half-sized DataFrames, and the two halves are merged together on the common columns, using suffixes to distinguish the columns from each half. The merged DataFrame is then renamed to have "stat" and "test" columns instead of the "value_1" and "value_2" columns. The unnecessary "name_1" and "name_2" columns are also dropped, and the columns are rearranged to have the desired order.
This results in a DataFrame that has the stat and test values in their own unique rows, grouped by category.
value_vars = ["BB_stat", "AA_stat", "BB_test", "AA_test"]
df = df.melt(id_vars=["Date", "start", "end", "area", "final"], value_vars=value_vars)
temp_df = df.variable.str.split("_", 1, expand=True)
df["type"] = temp_df[0]
df["name"] = temp_df[1]
df = df.drop(columns=["variable"])
first_half = df.iloc[:len(df)//2]
second_half = df.iloc[len(df)//2:]
df = pd.merge(first_half, second_half, on=["Date", "start", "end", "area", "final", "type"], suffixes=("_1", "_2"))
df.rename(columns = {'value_2':'test', 'value_1':'stat'}, inplace = True)
df.drop(columns=["name_1", "name_2"], inplace=True)
df = df[["Date", "start", "end", "type", "area", "stat", "test","final"]]
df.sort_values(["area", "type"], ascending=False, inplace=True)
df.to_markdown()
Outputs:
| | Date | start | end | type | area | stat | test | final |
|---:|:----------|:----------|:----------|:-------|:-------|-------:|-------:|:---------|
| 0 | 10/1/2022 | 11/1/2022 | 12/1/2022 | BB | NY | 10 | 0 | 1/1/2022 |
| 1 | 11/1/2022 | 12/1/2022 | 01/1/2023 | BB | NY | 5 | 1 | 1/1/2022 |
| 4 | 10/1/2022 | 11/1/2022 | 12/1/2022 | AA | NY | 80 | 1 | 1/1/2022 |
| 5 | 11/1/2022 | 12/1/2022 | 01/1/2023 | AA | NY | 90 | 0 | 1/1/2022 |
| 2 | 10/1/2022 | 11/1/2022 | 12/1/2022 | BB | CA | 6 | 3 | 1/1/2022 |
| 3 | 11/1/2022 | 12/1/2022 | 01/1/2023 | BB | CA | 7 | 2 | 1/1/2022 |
| 6 | 10/1/2022 | 11/1/2022 | 12/1/2022 | AA | CA | 100 | 1 | 1/1/2022 |
| 7 | 11/1/2022 | 12/1/2022 | 01/1/2023 | AA | CA | 0 | 8 | 1/1/2022 |