Group and count total of blanks and total of rows in pandas dataframe-CodePudding

I have the following dataframe

  Country   Name  Code Signed  Index
0      CZ  Paulo     3      x   1
1      AE  Paulo   Yes   None   1
2      AE  Paulo   Yes   None   2
3      AE  Paulo     1    Yes   5
4      CZ  Paulo  None   None   6
5      DK  Paulo   Yes   None   9
6      DK  Paulo  None   None   20
7      PT  Paulo     2    Yes   20
8      PT  Paulo     1    Yes   22

I need three new columns after grouping by country

count the missing values in Code and Signed column
total of rows that have both values filled
total of rows that have the same Country value
point the rows where we have any of those values blank per Country (list or non list format) using the column "Index" as reference

If any of the Countries have all their Code and Signed rows filled, remove it from the dataframe.

In this case, it would return this dataframe:

  Country  Total_Blanks_on_Code  Total_Blanks_on_Signed Total_of_rows_with_both_values_filled  Total_of_rows_of_the_Country  Rows with any blank
0      CZ                     1                       1                                  None                             2                    6
1      AE                     2                       0                                     1                             3                   [1,2]
2      DK                     2                       1                                  None                             2                   [9,20]

Thank you for your help!

CodePudding user response：

Here's a way to do what your question asks:

df['both_filled'] = (df.Code.notna() & df.Signed.notna()).map({True:True, False:None})
df['Rows_with_any_blank'] = df.Index[df['both_filled'].isna()]
gb = df.groupby('Country', sort=False)
df2 = ( gb.count().assign(
    Rows_with_any_blank=gb['Rows_with_any_blank']
    .agg(lambda x: list(x.dropna().astype(int)))) )
df2 = ( df2.assign(
    Total_Blanks_on_Code=df2.Name - df2.Code,
    Total_Blanks_on_Signed=df2.Name - df2.Signed)
    [df2.both_filled < df2.Name]
    [['Total_Blanks_on_Code','Total_Blanks_on_Signed',
        'both_filled','Name','Rows_with_any_blank']]
    .reset_index()
    .rename(columns={
        'Name':'Total_of_rows_of_the_Country', 
        'both_filled':'Total_of_rows_with_both_values_filled'
    }) )

Input:

  Country   Name  Code Signed  Index
0      CZ  Paulo     3      x      1
1      AE  Paulo   Yes   None      1
2      AE  Paulo   Yes   None      2
3      AE  Paulo     1    Yes      5
4      CZ  Paulo  None   None      6
5      DK  Paulo   Yes   None      9
6      DK  Paulo  None   None     20
7      PT  Paulo     2    Yes     20
8      PT  Paulo     1    Yes     22

Output:

  Country  Total_Blanks_on_Code  Total_Blanks_on_Signed  Total_of_rows_with_both_values_filled  Total_of_rows_of_the_Country Rows_with_any_blank
0      CZ                     1                       1                                      1                             2                 [6]
1      AE                     0                       2                                      1                             3              [1, 2]
2      DK                     1                       2                                      0                             2             [9, 20]

Explanation:

Create both_filled column which is True if both Code and Signed are non-null and is None otherwise (this allows us to later use count() to effectively sum the number of rows having both columns non-null)
Create Rows_with_any_blank column which contains the value in Index for rows where neither of Code and Signed is null
Create a groubpy object gb by Country
Use count() to get the number of non-null entries per group in each column of gb
Use assign() to overwrite the Rows_with_any_blank column to be a list of the non-null Index values for each group
Use assign() to create and populate columns Total_Blanks_on_Code and Total_Blanks_on_Signed
Keep only rows where the count in both_filled < the count in Name (which is the total number of rows in the original df); this removes any Country for which all Code and Signed rows are filled
Select the 5 desired columns in the specified order using [[]]
Use reset_index() to switch Country from the index to a column
Use rename() to change Name and both_filled to have the specified labels Total_of_rows_of_the_Country and Total_of_rows_with_both_values_filled.

CodePudding user response：

Based on the definitions/conditions you gave, the country AE should have a total blanks on Code equal to 0 and not 2.

Anyway, you can use the code below to get the format of output you're looking for :

out = (df.assign(Total1 = df['Code'].isna(),
                 Total2 = df['Signed?'].isna(),
                 Total3 = ~df['Code'].isna() & ~df['Signed?'].isna())
        .groupby('Country', as_index=False)
        .agg(NumberOfCountries = ('Country','size'),
             Total1 = ('Total1','sum'),
             Total2 = ('Total2','sum'),
             Total3 = ('Total3','sum'))
        ).rename(columns={'Total1': 'Total Blanks on Code', 'Total2': 'Total Blanks on Signed?', 
                    'Total3': 'Total of rows with both values filled', 'NumberOfCountries': 'Total of rows of the Country'})

>>> print(out) :

>>> `print(out)` :