Mapping data from one df to another based on 2 conditions-CodePudding

I am trying to figure out an efficient way to map data from one df to another. The first df looks like this:

    entity                  status                  sum
0   Company 1                C                       70
1   Company 1                E                       19
2   Company 2                C                       57
3   Company 2                E                       71
4   Company 3                E                       16
... ... ... ...
637 Company N                n                       16
638 Company Z                E                       88

639 rows × 3 columns

I want to assign them to a second data frame so that the status is represented as columns ["E", "C", "n"] (which are the values that the status column takes) with the sum as the value and at the same time the name of the company is used as the key. The second df:

   date          entity             sum
0   0202004      Company 1          90
1   0202004      Company 2          80
2   0202004      Company 3          30
3   0202004      Company 4          40
4   0202004      Company 5          50
... ... ... ...
28  0202004      Company 29         60
29  0202004      Company 30         70

and end df is:

   date          entity             sum        C     E      n
0   0202004      Company1           90        70     19     0
1   0202004      Company 2          80        71     57     0
2   0202004      Company 3          30        xy      xy    0
3   0202004      Company 4          40        xy      xy    xy
4   0202004      Company 5          50        xy      xy    xy
... ... ... ...
28  0202004      Company 29         60        xy      xy    xy
29  0202004      Company 30         70        xy      xy    xy

So far I've written some low quality code that obviously doesn't work. I would be grateful for your tips and help!

CodePudding user response：

You want to do 2 things at once, let's disentangle this

Step 1: First, combine the 2 dataframes with pd.merge.

merged = pd.merge(df2, df1, how='left', on=['entity'])

Note that your sum column shows up in both df's - which makes things a bit ugly. After the merge, the one from df2 will have the suffix '_x', the one from df1 will have the suffix '_y' (you will want to rename those, as they clearly do not refer to the same 'sum' concept).

I am assuming you want a 'left' merge, i.e., keep all rows that are in your second df and enrich the status information from your first df (given your output example ends on 'Company 30' and maintains the 'sum' info from the 2nd df). Check 'inner' and 'outer' as other options.

Step 2: Then, pivot the data on the column status, so that the multiple (unstacked) rows with the status information convert into status columns, and the values become the information from the sum of the former df1 (suffix '_y')

pd.pivot(merged, columns=['status'], values=['sum_y'])

Check out this documentation for more info

CodePudding user response：

My approach would be to first convert dataframe 1 into the correct order and then merge it with the second dataframe. The advantage is that the sum column in the second dataframe is still present in the correct form as requested in the question. First of all the input data I used in order to check if the final output is correct:

from io import StringIO
import pandas as pd

# Create your data frame 1
input_string = """entity                  status                  sum
0   Company 1                C                       70
1   Company 1                E                       19
2   Company 2                C                       57
3   Company 2                E                       71
4   Company 3                E                       16
5   Company 3                n                       16
6   Company 4                E                       3
"""

data = StringIO(input_string)
df1 = pd.read_csv(data, sep=r'\s\s ', engine='python')

# Create your data frame 2
input_string = """date          entity             sum
0   0202004      Company 1          90
1   0202004      Company 2          80
2   0202004      Company 3          30
3   0202004      Company 4          40
"""

data = StringIO(input_string)
df2 = pd.read_csv(data, sep=r'\s\s ', engine='python')

After that I change the appearance of the first dataframe:

df1_nice_form = df1.set_index(['entity', 'status'])['sum'].unstack().fillna(0).astype(int)
print(df1_nice_form)

Output:

status      C   E   n
entity               
Company 1  70  19   0
Company 2  57  71   0
Company 3   0  16  16
Company 4   0   3   0

Then, I merge the two dataframes with each other:

df2 = df2.merge(df1_nice_form, how='left', on='entity')
print(df2)

Final output:

     date     entity  sum   C   E   n
0  202004  Company 1   90  70  19   0
1  202004  Company 2   80  57  71   0
2  202004  Company 3   30   0  16  16
3  202004  Company 4   40   0   3   0

EDIT: You can also use the pivot function to transform the first dataframe. This would be the code:

df1_nice_form = df1.pivot('entity','status','sum').fillna(0).astype(int)