I have a dataset.
date a b c d
2021-07-01 1234 1111 123 x
2021-07-02 3456 3000 456 x
2021-07-03 5643 5142 501 x
2021-07-01 500 200 300 y
2021-07-03 1256 756 500 y
2021-07-01 1000 780 220 z
2021-07-02 3200 3000 200 z
2021-07-03 2345 2300 45 z
2021-07-04 2000 1700 300 z
There are 3 different companies in the dataset. a: total_Sales, b: net_sales, c: bonus sales, d: company name
I created a dataframe of company x (df_x). I want to add the a, b, c features of other companies (y and z) as new features to the new dataset (df_x). However, I want to add records that match date column of df_x. If there is a date that does not match, assign zero.
date a_x b_x c_x a_y b_y c_y a_z b_z c_z
2021-07-01 1234 1111 123 500 200 300 1000 780 220
2021-07-02 3456 3000 456 0 0 0 3200 3000 200
2021-07-03 5643 5142 501 1256 756 500 2000 1700 300
I tried to this: I created other companies datasets (df_y, df_z)
df_x = df_x.merge(df_y, on='Date', how='left', suffixes=['_x', '_y']).fillna(0)
df_x = df_x.merge(df_z, on='Date', how='left', suffixes=['', '_z']).fillna(0)
But it doesn't work. This code duplicated the columns. I also have nine companies in my main dataset (real work dataset).
How can I solve this problem?
CodePudding user response:
Use DataFrame.set_index
with DataFrame.unstack
, then sorting second level of MultiIndex
and last flatten it:
df = df.set_index(['date','d']).unstack(fill_value=0).sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
a_x b_x c_x a_y b_y c_y a_z b_z c_z
date
2021-07-01 1234 1111 123 500 200 300 1000 780 220
2021-07-02 3456 3000 456 0 0 0 3200 3000 200
2021-07-03 5643 5142 501 1256 756 500 2345 2300 45
2021-07-04 0 0 0 0 0 0 2000 1700 300
Or DataFrame.pivot
solution:
df = df.pivot('date','d').fillna(0).sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
a_x b_x c_x a_y b_y c_y a_z b_z c_z
date
2021-07-01 1234.0 1111.0 123.0 500.0 200.0 300.0 1000.0 780.0 220.0
2021-07-02 3456.0 3000.0 456.0 0.0 0.0 0.0 3200.0 3000.0 200.0
2021-07-03 5643.0 5142.0 501.0 1256.0 756.0 500.0 2345.0 2300.0 45.0
2021-07-04 0.0 0.0 0.0 0.0 0.0 0.0 2000.0 1700.0 300.0
If get:
"ValueError: Index contains duplicate entries, cannot reshape"
there are duplicates and need specify aggregate function, eg. here mean
in DataFrame.pivot_table
:
print (df)
date a b c d
0 2021-07-01 1234 1111 123 x <- duplicates 2021-07-01, x
1 2021-07-01 1234 222 100 x <- duplicates 2021-07-01, x
2 2021-07-02 3456 3000 456 x
3 2021-07-03 5643 5142 501 x
4 2021-07-01 500 200 300 y
5 2021-07-03 1256 756 500 y
6 2021-07-01 1000 780 220 z
7 2021-07-02 3200 3000 200 z
8 2021-07-03 2345 2300 45 z
9 2021-07-04 2000 1700 300 z
df = df.pivot_table(index='date',columns='d', aggfunc='mean', fill_value=0).sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
a_x b_x c_x a_y b_y c_y a_z b_z c_z
date
2021-07-01 1234 666.5 111.5 500 200 300 1000 780 220 <- 666.5, 111.5 is mean
2021-07-02 3456 3000.0 456.0 0 0 0 3200 3000 200
2021-07-03 5643 5142.0 501.0 1256 756 500 2345 2300 45
2021-07-04 0 0.0 0.0 0 0 0 2000 1700 300
CodePudding user response:
You can use pivot
or pivot_table
:
out = df.pivot('date', 'd', ['a', 'b', 'c']) \
.sort_index(level=1, axis=1).fillna(0).astype(int)
out.columns = out.columns.to_flat_index().str.join('_')
# OR (to merge duplicate entries)
out = df.pivot_table(['a', 'b', 'c'], 'date', 'd', aggfunc='mean') \
.sort_index(level=1, axis=1).fillna(0).astype(int)
out.columns = out.columns.to_flat_index().str.join('_')
You can replace 'mean'
by 'sum'
, 'min'
(the lowest value), 'max'
(the highest value), 'first'
(first value found), 'last'
(last value found) or whatever custom function.
Output:
>>> out.reset_index()
date a_x b_x c_x a_y b_y c_y a_z b_z c_z
0 2021-07-01 1234 1111 123 500 200 300 1000 780 220
1 2021-07-02 3456 3000 456 0 0 0 3200 3000 200
2 2021-07-03 5643 5142 501 1256 756 500 2345 2300 45
3 2021-07-04 0 0 0 0 0 0 2000 1700 300