How to Create New Columns from Rows in Python?-CodePudding

I have a dataset.

date        a     b     c    d
2021-07-01  1234  1111  123  x
2021-07-02  3456  3000  456  x
2021-07-03  5643  5142  501  x
2021-07-01  500   200   300  y
2021-07-03  1256  756   500  y
2021-07-01  1000  780   220  z
2021-07-02  3200  3000  200  z
2021-07-03  2345  2300  45   z
2021-07-04  2000  1700  300  z

There are 3 different companies in the dataset. a: total_Sales, b: net_sales, c: bonus sales, d: company name

I created a dataframe of company x (df_x). I want to add the a, b, c features of other companies (y and z) as new features to the new dataset (df_x). However, I want to add records that match date column of df_x. If there is a date that does not match, assign zero.

date        a_x   b_x   c_x   a_y  b_y  c_y  a_z   b_z   c_z
2021-07-01  1234  1111  123   500  200  300  1000  780   220
2021-07-02  3456  3000  456   0    0    0    3200  3000  200
2021-07-03  5643  5142  501   1256 756  500  2000  1700  300

I tried to this: I created other companies datasets (df_y, df_z)

df_x = df_x.merge(df_y, on='Date', how='left', suffixes=['_x', '_y']).fillna(0)
df_x = df_x.merge(df_z, on='Date', how='left', suffixes=['', '_z']).fillna(0)

But it doesn't work. This code duplicated the columns. I also have nine companies in my main dataset (real work dataset).

How can I solve this problem?

CodePudding user response：

Use DataFrame.set_index with DataFrame.unstack, then sorting second level of MultiIndex and last flatten it:

df = df.set_index(['date','d']).unstack(fill_value=0).sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
             a_x   b_x  c_x   a_y  b_y  c_y   a_z   b_z  c_z
date                                                        
2021-07-01  1234  1111  123   500  200  300  1000   780  220
2021-07-02  3456  3000  456     0    0    0  3200  3000  200
2021-07-03  5643  5142  501  1256  756  500  2345  2300   45
2021-07-04     0     0    0     0    0    0  2000  1700  300

Or DataFrame.pivot solution:

df = df.pivot('date','d').fillna(0).sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
               a_x     b_x    c_x     a_y    b_y    c_y     a_z     b_z    c_z
date                                                                          
2021-07-01  1234.0  1111.0  123.0   500.0  200.0  300.0  1000.0   780.0  220.0
2021-07-02  3456.0  3000.0  456.0     0.0    0.0    0.0  3200.0  3000.0  200.0
2021-07-03  5643.0  5142.0  501.0  1256.0  756.0  500.0  2345.0  2300.0   45.0
2021-07-04     0.0     0.0    0.0     0.0    0.0    0.0  2000.0  1700.0  300.0

If get:

"ValueError: Index contains duplicate entries, cannot reshape"

there are duplicates and need specify aggregate function, eg. here mean in DataFrame.pivot_table:

print (df)
         date     a     b    c  d
0  2021-07-01  1234  1111  123  x <- duplicates 2021-07-01, x
1  2021-07-01  1234   222  100  x <- duplicates 2021-07-01, x
2  2021-07-02  3456  3000  456  x
3  2021-07-03  5643  5142  501  x
4  2021-07-01   500   200  300  y
5  2021-07-03  1256   756  500  y
6  2021-07-01  1000   780  220  z
7  2021-07-02  3200  3000  200  z
8  2021-07-03  2345  2300   45  z
9  2021-07-04  2000  1700  300  z


df = df.pivot_table(index='date',columns='d', aggfunc='mean', fill_value=0).sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
             a_x     b_x    c_x   a_y  b_y  c_y   a_z   b_z  c_z
date                                                            
2021-07-01  1234   666.5  111.5   500  200  300  1000   780  220 <- 666.5, 111.5 is mean
2021-07-02  3456  3000.0  456.0     0    0    0  3200  3000  200
2021-07-03  5643  5142.0  501.0  1256  756  500  2345  2300   45
2021-07-04     0     0.0    0.0     0    0    0  2000  1700  300

CodePudding user response：

You can use pivot or pivot_table:

out = df.pivot('date', 'd', ['a', 'b', 'c']) \
        .sort_index(level=1, axis=1).fillna(0).astype(int)
out.columns = out.columns.to_flat_index().str.join('_')

# OR (to merge duplicate entries)

out = df.pivot_table(['a', 'b', 'c'], 'date', 'd', aggfunc='mean') \
        .sort_index(level=1, axis=1).fillna(0).astype(int)
out.columns = out.columns.to_flat_index().str.join('_')

You can replace 'mean' by 'sum', 'min' (the lowest value), 'max' (the highest value), 'first' (first value found), 'last' (last value found) or whatever custom function.

Output:

>>> out.reset_index()
         date   a_x   b_x  c_x   a_y  b_y  c_y   a_z   b_z  c_z
0  2021-07-01  1234  1111  123   500  200  300  1000   780  220
1  2021-07-02  3456  3000  456     0    0    0  3200  3000  200
2  2021-07-03  5643  5142  501  1256  756  500  2345  2300   45
3  2021-07-04     0     0    0     0    0    0  2000  1700  300