I am trying to figure out an efficient way to map data from one df to another. The first df looks like this:
entity status sum
0 Company 1 C 70
1 Company 1 E 19
2 Company 2 C 57
3 Company 2 E 71
4 Company 3 E 16
... ... ... ...
637 Company N n 16
638 Company Z E 88
639 rows × 3 columns
I want to assign them to a second data frame so that the status is represented as columns ["E", "C", "n"] (which are the values that the status
column takes) with the sum as the value and at the same time the name of the company is used as the key.
The second df:
date entity sum
0 0202004 Company 1 90
1 0202004 Company 2 80
2 0202004 Company 3 30
3 0202004 Company 4 40
4 0202004 Company 5 50
... ... ... ...
28 0202004 Company 29 60
29 0202004 Company 30 70
and end df is:
date entity sum C E n
0 0202004 Company1 90 70 19 0
1 0202004 Company 2 80 71 57 0
2 0202004 Company 3 30 xy xy 0
3 0202004 Company 4 40 xy xy xy
4 0202004 Company 5 50 xy xy xy
... ... ... ...
28 0202004 Company 29 60 xy xy xy
29 0202004 Company 30 70 xy xy xy
So far I've written some low quality code that obviously doesn't work. I would be grateful for your tips and help!
CodePudding user response:
You want to do 2 things at once, let's disentangle this
Step 1: First, combine the 2 dataframes with pd.merge
.
merged = pd.merge(df2, df1, how='left', on=['entity'])
Note that your sum
column shows up in both df's - which makes things a bit ugly. After the merge, the one from df2 will have the suffix '_x', the one from df1 will have the suffix '_y' (you will want to rename those, as they clearly do not refer to the same 'sum' concept).
I am assuming you want a 'left' merge, i.e., keep all rows that are in your second df and enrich the status information from your first df (given your output example ends on 'Company 30' and maintains the 'sum' info from the 2nd df). Check 'inner' and 'outer' as other options.
Step 2: Then, pivot the data on the column status
, so that the multiple (unstacked) rows with the status information convert into status columns, and the values become the information from the sum of the former df1 (suffix '_y')
pd.pivot(merged, columns=['status'], values=['sum_y'])
Check out this documentation for more info
CodePudding user response:
My approach would be to first convert dataframe 1 into the correct order and then merge it with the second dataframe. The advantage is that the sum
column in the second dataframe is still present in the correct form as requested in the question. First of all the input data I used in order to check if the final output is correct:
from io import StringIO
import pandas as pd
# Create your data frame 1
input_string = """entity status sum
0 Company 1 C 70
1 Company 1 E 19
2 Company 2 C 57
3 Company 2 E 71
4 Company 3 E 16
5 Company 3 n 16
6 Company 4 E 3
"""
data = StringIO(input_string)
df1 = pd.read_csv(data, sep=r'\s\s ', engine='python')
# Create your data frame 2
input_string = """date entity sum
0 0202004 Company 1 90
1 0202004 Company 2 80
2 0202004 Company 3 30
3 0202004 Company 4 40
"""
data = StringIO(input_string)
df2 = pd.read_csv(data, sep=r'\s\s ', engine='python')
After that I change the appearance of the first dataframe:
df1_nice_form = df1.set_index(['entity', 'status'])['sum'].unstack().fillna(0).astype(int)
print(df1_nice_form)
Output:
status C E n
entity
Company 1 70 19 0
Company 2 57 71 0
Company 3 0 16 16
Company 4 0 3 0
Then, I merge the two dataframes with each other:
df2 = df2.merge(df1_nice_form, how='left', on='entity')
print(df2)
Final output:
date entity sum C E n
0 202004 Company 1 90 70 19 0
1 202004 Company 2 80 57 71 0
2 202004 Company 3 30 0 16 16
3 202004 Company 4 40 0 3 0
EDIT: You can also use the pivot function to transform the first dataframe. This would be the code:
df1_nice_form = df1.pivot('entity','status','sum').fillna(0).astype(int)