Combining Rows Based on Column Value-CodePudding

I have a sample similar to the problem I am running into. Here, I have company name and revenue for 3 years. The revenue is given in 3 different datasets. When I concatenate the data, it looks as follows:

  company_name  2020 Revenue  2021 Revenue  2022 Revenue
0    company_1          10.0           NaN           NaN
1    company_2          20.0           NaN           NaN
2    company_3          30.0           NaN           NaN
3    company_1           NaN          20.0           NaN
4    company_2           NaN          30.0           NaN
5    company_3           NaN          40.0           NaN
6    company_1           NaN           NaN          50.0
7    company_2           NaN           NaN          60.0
8    company_3           NaN           NaN          70.0
9    company_4           NaN           NaN          80.0

What I am trying to do is have company name followed by the actual revenue columns. In a sense drop the duplicate company_name rows and put that data into the corresponding company_name. An image of my desired output:

  company_name  2020 Revenue  2021 Revenue  2022 Revenue
0    company_1            10            20            50
1    company_2            20            30            60
2    company_3            30            40            70
3    company_4             0             0            80

CodePudding user response：

Use melt and pivot_table:

out = (df.melt('company_name').dropna()
         .pivot_table('value', 'company_name', 'variable', fill_value=0)
         .rename_axis(columns=None).reset_index())
print(out)

# Output
  company_name  2020 Revenue  2021 Revenue  2022 Revenue
0    company_1            10            20            50
1    company_2            20            30            60
2    company_3            30            40            70
3    company_4             0             0            80

CodePudding user response：

You can try:

df.set_index('company_name').stack().unstack().reset_index()

df.groupby('company_name', as_index=False).first()

Output:

  company_name  2020 Revenue  2021 Revenue  2022 Revenue
0    company_1          10.0          20.0          50.0
1    company_2          20.0          30.0          60.0
2    company_3          30.0          40.0          70.0
3    company_4           NaN           NaN          80.0

CodePudding user response：

I would say your concat might not be the join you should be using, but instead try df_merge pd.merge(df1, df2, how = 'inner', left_on = 'company', left_on = 'company') Then you can do that against with df_merge (your newly merged data) and the next dataframe. This should keep everything in line with each other and only add columns that they do not share. If they don't only have the 2 columns you are looking at you might need to do a little more cleaning of the data to get only the results you are looking for, but that should for the most part get you started and your data all in the correct place.