Home > front end >  Pandas convert long to wide based on multi labels(composite keys)
Pandas convert long to wide based on multi labels(composite keys)

Time:10-05

I have a dataframe that looks something like this

df_input = pd.DataFrame({'datatype':['IN','IN','EX','EX','IN','IN','EX','EX'], 'Time': [0,0,0,0,1,1,1,1], 'filetype': ['img','txt','img','txt','img','txt','img','txt']}, index=['PMSN01001','PMSN01001','PMSN01001','PMSN01001','PMSN01001','PMSN01001','PMSN01001','PMSN01001'])

print(df_input)

id          datatype    Time    filetype
PMSN01001   IN        0       img
PMSN01001   IN        0       txt
PMSN01001   EX        0       img
PMSN01001   EX        0       txt
PMSN01001   IN        1       img
PMSN01001   IN        1       txt
PMSN01001   EX        1       img
PMSN01001   EX        1       txt

I want to reshape the data frame from long to wide format like the below. I have seen many solutions in the community but few deal with column name setup intuitive and simple enough for beginners like me to understand. I know I have to use .Pivot somehow but I am struggling with two things: easy way to set up column name and how to set up multiindex(composite primary key)

worth nothing, for every id, with the combination of ['datatype', 'Time', 'filetype'], can make up to 8 data records(2x2x2). So pivoted columns should also reflect accordingly..i think..

df_output = pd.DataFrame({'datatype_time0':['IN'], 'Time_time0': [0], 'filetype_time0': ['img'],'filetype2_time0':['txt'], 'datatype_time1':['EX'], 'Time_time1':[1],  'filetype_time1': ['img'],'filetype2_time1':['txt']}, index=['PMSN01001'])
df_output
           datatype_time0   Time_time0  filetype_time0  filetype2_time0      datatype_time1 Time_time1  filetype_time1  filetype2_time1
 PMSN01001    IN                 0              img          txt               EX              1                 img              txt    
                                                             

I am open to any ways on re-naming the column names as long as i can keep track of what values belongs to what

thanks

CodePudding user response:

Try using groupby, cumcount and pivot_table:

x = df.groupby(['id', 'datatype']).agg(Time=('Time', 'first'), filetype1=('filetype', 'first'), filetype2=('filetype', 'last'))
x = x.assign(idx=x.groupby(level=0).cumcount()   1).reset_index().pivot_table(index=['id'], columns='idx', 
                    values=['datatype', 'Time', 'filetype1', 'filetype2'], aggfunc='first')
x = x.sort_index(axis=1, level=1)
x = x.set_axis([f'{x}_{y}' for x,y in x.columns], axis=1).reset_index()

>>> x
          id  Time_1 datatype_1 filetype1_1 filetype2_1  Time_2 datatype_2 filetype1_2 filetype2_2
0  PMSN01001       1         EX         img         txt       0         IN         img         txt
>>> 

CodePudding user response:

General solution for counter per filetype with GroupBy.cumcount, pivoting by DataFrame.pivot and then reshaping by DataFrame.set_index with DataFrame.unstack:

df_input = df_input.rename_axis('id').reset_index()

df_input['g'] = df_input.groupby(['id','Time','datatype']).cumcount().add(1)

df = df_input.pivot(['id','datatype','Time'], columns='g', values='filetype').add_prefix('filetype')
df = df.set_index(df.groupby('id').cumcount().add(1), append=True).reset_index(level=['datatype','Time']).unstack(-1).sort_index(level=1, axis=1)

df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
df = df.reset_index()
print (df)
          id  Time_1 datatype_1 filetype1_1 filetype2_1  Time_2 datatype_2  \
0  PMSN01001       1         EX         img         txt       0         IN   

  filetype1_2 filetype2_2  
0         img         txt  
  • Related