Convert DataFrameGroupBy to Dataframe (Keeping Values, no counting/summing etc)-CodePudding

I am struggling with a problem and was hoping to get some help.

I have a DF that I want to groupby before merging with another DF, the only problem is, after I groupby, I am unable to merge it due to it being a 'DataFrameGroupBy' object.

The steps would be as follows:

Groupby the DF to consolidate the information into the 2 id's:

dftest = cov_DF.groupby(['lstID','covID'])

which results from a DF going from 121 rows to 101. Then merge that on 'lstID' with a different df.

The answers I'm seeing on here are all related to users summing/counting/maxing and this doesn't apply to me.

To explain more of my situation, I'm iterating through XML and am appending certain things and creating a DF. Now I want to group that DF into the columns above so there are no duplicates in lstID covID and the columns then contain what I need.

An example of the table initially looks like this:

lstID | covID | covPrem | covBase | CovValue
1        1         10        NA        NA
1        1         NA        2         NA

And so I want that to turn into the below table, before I merge.

lstID | covID | covPrem | covBase | CovValue
1        1         10        2        NA

Should I be using a different function? I feel like groupBy is working how I want it too but it's also annoying that it's not a DF so I can't merge until I change it back.

There will always be a value in lstID and covID

CodePudding user response：

You can combine groupby with bfill and ffill to achieve this. It fills nan values for each group by the value from the previous or next row (if the value is not nan).

Code:

import pandas as pd

df = pd.DataFrame({
    "lstID": [1, 1],
    "covID": [1, 1],
    "covprem": [10, pd.NA],
    "covbase": [pd.NA, 2],
    "corval": [pd.NA, pd.NA]
})

df.set_index(["lstID", "covID"]).groupby(level=[0, 1]).ffill().bfill().groupby(level=[0, 1]).first().reset_index()

Alternative:

Use drop_duplicates instead of the second groupby.

df.set_index(["lstID", "covID"]).groupby(level=[0, 1]).ffill().bfill().reset_index().drop_duplicates(subset=["lstID", "covID"])

Output:

    lstID  covID  covprem  covbase  corval
0   1      1      10       2        <NA>

CodePudding user response：

When you groupby a datsframe, the index of dataframe become strange. You must reset index then use aggregate functions then you can merge with other dataframes. Also you must consider the new names of columns. This solve your problem with dataframe groupby object

a = df.groupby(...).aggregate(...).reset_index()