Create 3 dataframes from original dataframe using existing values-CodePudding

I have the following dataframe which I'm wanting to create 3 new dataframes from using the values in specific columns (ppbeid, initpen and incpen) and using the unique entries in the benid and id columns:

In part of my code, below, I'm using a list of unique items in the benid column, then removing any blanks from the list. This will give me some of the column headers I want in the new dataframes, but I also want the unique ids in the id column, the aforementioned list of unique benid items and, for a couple of the new dataframes, a total column too (see last 3 Excel screenshots):

lst_benids = df_ppbens.benid.unique()
lst_benids = list(filter(None, lst_benids))

# Result is: ['PENSION', 'POST', 'PRE8', 'SPOUSE', 'RULE29', 'MOD']

I know how to achieve this in Excel using Index/Match/Match, but it's long-winded and I really want to learn how to do this in Pandas. The output should be the following 3 dataframes (which I'll then export to Excel in different worksheets):

First dataframe should be what the ppbeid column entry is, for the corresponding benid field, listed by the unique ids:

The second dataframe should be the initpen figures for those unique ids are and the specific corresponding benid, with a total column at the end:

The third and final dataframe is the same as above but instead it's got the incpen column figures for corresponding benids and a total column at the end:

Any help is much appreciated and it will help me learn something I have to do manually in Excel a lot. Being new to Pandas/Python, I'm finding it confusing navigating the documents and other resources online. Thanks

CodePudding user response：

It seems like what you want can be done with the pivot method, which is similar to Excel's pivot table.

First let's set up the data:

df = pd.DataFrame(
    {
        "id": [92, 92, 133, 133, 133, 705, 705, 705, 588, 588],
        "initpen": [0] * 8   [606.32, 1559.39],
        "incpen": [963.18, 462, 886.08, 529.32, 609.6, 0, 0, 0, 624.52, 1635.8],
        "benid": ["PENSION", "POST", "PRE8", "PENSION", "POST", "POST", "PRE8", "PENSION", "POST", "PENSION",],
        # I got tired of typing out the whole numbers...
        "ppbeid": [6197, 6197, 61990, 61998, 61990, 828, 828, 828, 8289, 8289],
    }
)

Then you can simply do:

df1 = df.pivot(index='id', columns='benid', values='ppbeid')

And for the others, substitute the appropriate variable name for ppbeid.

The to add the total just do:

df1['Total'] = df1.sum(1)

CodePudding user response：

Alternate solution, although Josh's answer did point me in the right direction. Instead of using pivot, I'm using pd.pivot_table.

For the first dataframe that I needed, I used:

df1 = pd.pivot_table(df, index='pempid', columns='benid', values='ppbeid', dropna=False, fill_value='')

For the other two dataframes that I needed, I passed in additional parameters aggfunc (to sum up my rows), margins (to get a total column) and margins_name (the header for the total column). I did this separately for both initpen and incpen by changing the values parameter:

df_initpen = pd.pivot_table(df, index='pempid', columns='benid', values='initpen', dropna=False, fill_value=0, aggfunc='sum', margins=True, margins_name='Total')

This Python Pivot Tables Tutorial - YouTube video has more details on how to use pd.pivot_table and helped me arrive to this solution.