I have the following df from a vendor:
Unnamed: 0 Unnamed: 1 Unnamed: 2 agg metrics 10/20/22 10/20/22 10/21/22 10/21/22
title content title season episode start hours start hours
book blue 1 3 2 2 5 2
movie orange 2 4 11 4 7 4
I need the output like this:
title content title season episode date start hours
book blue 1 3 10/20/22 2 2
book blue 1 3 10/21/22 5 2
movie orange 2 4 10/20/22 11 4
movie orange 2 4 10/21/22 7 4
df = pd.read_csv('file')
df = df.drop(labels=0, axis=0)
df1 = df.melt(['Unnamed: 0','Unnamed: 1', 'Unnamed: 2', 'agg metrics'],var_name='Date', value_name='Value')
but this doesn't return the proper output. apologies for not knowing how to represent this properly. hopefully my IP/OP helps.
Essentially, i'm having trouble transposing multiple headers.
Thanks for your help!
CodePudding user response:
You could do this and this is what QuangHoang's thought too I believe:
# Read csv with top two rows as headers resulting in multiindex, from your code I figure
# you are not doing that.
df = pd.read_csv(
StringIO(
"""
Unnamed: 0,Unnamed: 1,Unnamed: 2,agg metrics,10/20/22,10/20/22,10/21/22,10/21/22
title,content title,season,episode,start,hours,start,hours
book,blue,1,3,2,2,5,2
movie,orange,2,4,11,4,7,4
"""
),
header=[0, 1],
)
# Then filter columns that are date like and stack at level 0 and reset_index
t = df.filter(regex="\d /\d /\d ")
t1 = t.stack(0).rename_axis(["", "date"]).reset_index(1)
# Then get other columns and reindex to the index of the intermediate output you got above.
t2 = df[df.columns.difference(t.columns)].droplevel(0, axis=1).reindex(t1.index)
# Then concat both along axis 1
out = pd.concat([t2, t1], axis=1)
print(out)
title content title season episode date hours start
0 book blue 1 3 10/20/22 2 2
0 book blue 1 3 10/21/22 2 5
1 movie orange 2 4 10/20/22 4 11
1 movie orange 2 4 10/21/22 4 7
CodePudding user response:
Here's an example of what I mean:
# mock csv file with StringIO
s = StringIO('''
Unnamed: 0 Unnamed: 1 Unnamed: 2 agg metrics 10/20/22 10/20/22 10/21/22 10/21/22
title content title season episode start hours start hours
book blue 1 3 2 2 5 2
movie orange 2 4 11 4 7 4
''')
# forget `sep` argument if your file is Comma Separated Value
df = pd.read_csv(s, sep='\s\s ', header=[0,1], index_col=[0,1,2,3])
df.stack(level=0).reset_index()
Output (rename your columns accordingly):
title level_0 level_1 level_2 level_3 Unnamed: 0 hours start
0 book blue 1 3 10/20/22 2 2
1 book blue 1 3 10/21/22 2 5
2 movie orange 2 4 10/20/22 4 11
3 movie orange 2 4 10/21/22 4 7