I have a dataframe that looks something like this
df_input = pd.DataFrame({'datatype':['IN','IN','EX','EX','IN','IN','EX','EX'], 'Time': [0,0,0,0,1,1,1,1], 'filetype': ['img','txt','img','txt','img','txt','img','txt']}, index=['PMSN01001','PMSN01001','PMSN01001','PMSN01001','PMSN01001','PMSN01001','PMSN01001','PMSN01001'])
print(df_input)
id datatype Time filetype
PMSN01001 IN 0 img
PMSN01001 IN 0 txt
PMSN01001 EX 0 img
PMSN01001 EX 0 txt
PMSN01001 IN 1 img
PMSN01001 IN 1 txt
PMSN01001 EX 1 img
PMSN01001 EX 1 txt
I want to reshape the data frame from long to wide format like the below. I have seen many solutions in the community but few deal with column name setup intuitive and simple enough for beginners like me to understand. I know I have to use .Pivot somehow but I am struggling with two things: easy way to set up column name and how to set up multiindex(composite primary key)
worth nothing, for every id, with the combination of ['datatype', 'Time', 'filetype'], can make up to 8 data records(2x2x2). So pivoted columns should also reflect accordingly..i think..
df_output = pd.DataFrame({'datatype_time0':['IN'], 'Time_time0': [0], 'filetype_time0': ['img'],'filetype2_time0':['txt'], 'datatype_time1':['EX'], 'Time_time1':[1], 'filetype_time1': ['img'],'filetype2_time1':['txt']}, index=['PMSN01001'])
df_output
datatype_time0 Time_time0 filetype_time0 filetype2_time0 datatype_time1 Time_time1 filetype_time1 filetype2_time1
PMSN01001 IN 0 img txt EX 1 img txt
I am open to any ways on re-naming the column names as long as i can keep track of what values belongs to what
thanks
CodePudding user response:
Try using groupby
, cumcount
and pivot_table
:
x = df.groupby(['id', 'datatype']).agg(Time=('Time', 'first'), filetype1=('filetype', 'first'), filetype2=('filetype', 'last'))
x = x.assign(idx=x.groupby(level=0).cumcount() 1).reset_index().pivot_table(index=['id'], columns='idx',
values=['datatype', 'Time', 'filetype1', 'filetype2'], aggfunc='first')
x = x.sort_index(axis=1, level=1)
x = x.set_axis([f'{x}_{y}' for x,y in x.columns], axis=1).reset_index()
>>> x
id Time_1 datatype_1 filetype1_1 filetype2_1 Time_2 datatype_2 filetype1_2 filetype2_2
0 PMSN01001 1 EX img txt 0 IN img txt
>>>
CodePudding user response:
General solution for counter per filetype
with GroupBy.cumcount
, pivoting by DataFrame.pivot
and then reshaping by DataFrame.set_index
with DataFrame.unstack
:
df_input = df_input.rename_axis('id').reset_index()
df_input['g'] = df_input.groupby(['id','Time','datatype']).cumcount().add(1)
df = df_input.pivot(['id','datatype','Time'], columns='g', values='filetype').add_prefix('filetype')
df = df.set_index(df.groupby('id').cumcount().add(1), append=True).reset_index(level=['datatype','Time']).unstack(-1).sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
df = df.reset_index()
print (df)
id Time_1 datatype_1 filetype1_1 filetype2_1 Time_2 datatype_2 \
0 PMSN01001 1 EX img txt 0 IN
filetype1_2 filetype2_2
0 img txt