Home > Back-end >  Merge rows dataframe based on two columns - Python
Merge rows dataframe based on two columns - Python

Time:07-28

I have a data frame df that has the following information:

|   Project | class   | x     |y    | z   |
| ---       | ---     | ---   | --- | --- | 
| Project_A | c.java  |a1     | a2  |     |
| Project_A | c.java  |       |     | a3  |
| Project_b | t.java  |b1     |b2   |     |

I need to combine these lines if the values of project and Class are equal among them.

Based on the previous example, the aspected output in this case have be:

| Project   | class  | x     |y    | z   |
| ---       | ---    | ---   | --- | --- |
| Project_A | c.java |a1     |a2   | a3  |
| Project_b | t.java |b1     |b2   |     |

it is also important to note that the way the dataset was constructed, there is no risk of rewriting values; so, in other words, you will never have such a situation:

| Project   | class  | x   |y   | z   |
| ---       | ---    | --- | ---| --- |
| Project_A | c.java |a1   | a2 |     |
| Project_A | c.java |a_x  | a_y|a_z  |
| Project_b | t.java |b1   |b2  |     |

How can do it?

CodePudding user response:

This will groupby the project and class and then find the first value of the grouping. I'm not sure if your data will allow for something like this since if there is another example of data in the column for the project/class combination it might mess up some of your data

df.groupby(['Project', 'class'], as_index=False).agg('first')

CodePudding user response:

You can use groupby() on both columns and sum the others:

X = pd.DataFrame({
        'Project':["Project_A","Project_A",'Project_b'],
        'class':["c.java","c.java","t.java"],
        'x':["a1",None,"b1"],
        'y':["a2",None,"b2"],
        'z':[None,"a3",None]})

Y= X.groupby(['Project','class']).sum()
print(Y)

Output:

                   x   y   z
Project   class
Project_A c.java  a1  a2  a3
Project_b t.java  b1  b2   0
  • Related