Home > database >  How to convert wide dataframe to long based on similar column
How to convert wide dataframe to long based on similar column

Time:04-14

I have a pandas dataframe like this

enter image description here

and i want to convert it to below dataframe

enter image description here

i am not sure how to use pd.wide_to_long function here

below is the dataset for creating dataframe:

Date, IN:Male teacher ,IN:Male engineer, IN: Male Atronaut , IN:female teacher ,IN:female engineer, IN: female Atronaut ,GB:Male teacher ,GB:Male engineer, GB: Male Atronaut,GB:female teacher ,GB:female engineer, GB: female Atronaut
20220405,25,29,5,41,23,23,12,23,34,11,22,34
20220404,21,29,4,40,23,22,12,23,32,10,23,34

CodePudding user response:

Convert Date column to index and for all another columns remove possible traling spaces by str.strip, then replace spaces to : and last split by one or more : to MultiIndex, so possible reshape by DataFrame.stack with DataFrame.rename_axis for new columns names created by DataFrame.reset_index:

df1 = df.set_index('Date')
df1.columns = df1.columns.str.strip().str.replace('\s ', ':').str.split('[:] ', expand=True)

df1 = df1.stack([0,1]).rename_axis(['Date','Symbol','Gender']).reset_index()
print (df1)
       Date Symbol  Gender  Atronaut  engineer  teacher
0  20220405     GB    Male        34        23       12
1  20220405     GB  female        34        22       11
2  20220405     IN    Male         5        29       25
3  20220405     IN  female        23        23       41
4  20220404     GB    Male        32        23       12
5  20220404     GB  female        34        23       10
6  20220404     IN    Male         4        29       21
7  20220404     IN  female        22        23       40

CodePudding user response:

pivot_longer from pyjanitor offers an easy way to abstract the reshaping; in this case it can be solved with a regular expression:

# pip install pyjanitor
import pandas as pd
import janitor

df.pivot_longer(
    index = 'Date', 
    names_to = ('symbol', 'gender', '.value'), 
    names_pattern = r"(. ):\s*(. )\s (. )",
    sort_by_appearance = True)

       Date symbol  gender  teacher  engineer  Atronaut
0  20220405     IN    Male       25        29         5
1  20220405     IN  female       41        23        23
2  20220405     GB    Male       12        23        34
3  20220405     GB  female       11        22        34
4  20220404     IN    Male       21        29         4
5  20220404     IN  female       40        23        22
6  20220404     GB    Male       12        23        32
7  20220404     GB  female       10        23        34

The regular expression has capture groups, any group paired with .value stays as a header, the rest become column values.

  • Related