I have a pandas dataframe like this
and i want to convert it to below dataframe
i am not sure how to use pd.wide_to_long function here
below is the dataset for creating dataframe:
Date, IN:Male teacher ,IN:Male engineer, IN: Male Atronaut , IN:female teacher ,IN:female engineer, IN: female Atronaut ,GB:Male teacher ,GB:Male engineer, GB: Male Atronaut,GB:female teacher ,GB:female engineer, GB: female Atronaut
20220405,25,29,5,41,23,23,12,23,34,11,22,34
20220404,21,29,4,40,23,22,12,23,32,10,23,34
CodePudding user response:
Convert Date
column to index
and for all another columns remove possible traling spaces by str.strip
, then replace spaces to :
and last split by one or more :
to MultiIndex
, so possible reshape by DataFrame.stack
with DataFrame.rename_axis
for new columns names created by DataFrame.reset_index
:
df1 = df.set_index('Date')
df1.columns = df1.columns.str.strip().str.replace('\s ', ':').str.split('[:] ', expand=True)
df1 = df1.stack([0,1]).rename_axis(['Date','Symbol','Gender']).reset_index()
print (df1)
Date Symbol Gender Atronaut engineer teacher
0 20220405 GB Male 34 23 12
1 20220405 GB female 34 22 11
2 20220405 IN Male 5 29 25
3 20220405 IN female 23 23 41
4 20220404 GB Male 32 23 12
5 20220404 GB female 34 23 10
6 20220404 IN Male 4 29 21
7 20220404 IN female 22 23 40
CodePudding user response:
pivot_longer from pyjanitor offers an easy way to abstract the reshaping; in this case it can be solved with a regular expression:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(
index = 'Date',
names_to = ('symbol', 'gender', '.value'),
names_pattern = r"(. ):\s*(. )\s (. )",
sort_by_appearance = True)
Date symbol gender teacher engineer Atronaut
0 20220405 IN Male 25 29 5
1 20220405 IN female 41 23 23
2 20220405 GB Male 12 23 34
3 20220405 GB female 11 22 34
4 20220404 IN Male 21 29 4
5 20220404 IN female 40 23 22
6 20220404 GB Male 12 23 32
7 20220404 GB female 10 23 34
The regular expression has capture groups, any group paired with .value
stays as a header, the rest become column values.