I have the following dataframe
Country Name Code Signed Index
0 CZ Paulo 3 x 1
1 AE Paulo Yes None 1
2 AE Paulo Yes None 2
3 AE Paulo 1 Yes 5
4 CZ Paulo None None 6
5 DK Paulo Yes None 9
6 DK Paulo None None 20
7 PT Paulo 2 Yes 20
8 PT Paulo 1 Yes 22
I need three new columns after grouping by country
- count the missing values in
Code
andSigned
column - total of rows that have both values filled
- total of rows that have the same Country value
- point the rows where we have any of those values blank per Country (list or non list format) using the column "Index" as reference
If any of the Countries have all their Code
and Signed
rows filled, remove it from the dataframe.
In this case, it would return this dataframe:
Country Total_Blanks_on_Code Total_Blanks_on_Signed Total_of_rows_with_both_values_filled Total_of_rows_of_the_Country Rows with any blank
0 CZ 1 1 None 2 6
1 AE 2 0 1 3 [1,2]
2 DK 2 1 None 2 [9,20]
Thank you for your help!
CodePudding user response:
Here's a way to do what your question asks:
df['both_filled'] = (df.Code.notna() & df.Signed.notna()).map({True:True, False:None})
df['Rows_with_any_blank'] = df.Index[df['both_filled'].isna()]
gb = df.groupby('Country', sort=False)
df2 = ( gb.count().assign(
Rows_with_any_blank=gb['Rows_with_any_blank']
.agg(lambda x: list(x.dropna().astype(int)))) )
df2 = ( df2.assign(
Total_Blanks_on_Code=df2.Name - df2.Code,
Total_Blanks_on_Signed=df2.Name - df2.Signed)
[df2.both_filled < df2.Name]
[['Total_Blanks_on_Code','Total_Blanks_on_Signed',
'both_filled','Name','Rows_with_any_blank']]
.reset_index()
.rename(columns={
'Name':'Total_of_rows_of_the_Country',
'both_filled':'Total_of_rows_with_both_values_filled'
}) )
Input:
Country Name Code Signed Index
0 CZ Paulo 3 x 1
1 AE Paulo Yes None 1
2 AE Paulo Yes None 2
3 AE Paulo 1 Yes 5
4 CZ Paulo None None 6
5 DK Paulo Yes None 9
6 DK Paulo None None 20
7 PT Paulo 2 Yes 20
8 PT Paulo 1 Yes 22
Output:
Country Total_Blanks_on_Code Total_Blanks_on_Signed Total_of_rows_with_both_values_filled Total_of_rows_of_the_Country Rows_with_any_blank
0 CZ 1 1 1 2 [6]
1 AE 0 2 1 3 [1, 2]
2 DK 1 2 0 2 [9, 20]
Explanation:
- Create
both_filled
column which isTrue
if bothCode
andSigned
are non-null and isNone
otherwise (this allows us to later usecount()
to effectively sum the number of rows having both columns non-null) - Create
Rows_with_any_blank
column which contains the value inIndex
for rows where neither ofCode
andSigned
is null - Create a
groubpy
objectgb
byCountry
- Use
count()
to get the number of non-null entries per group in each column ofgb
- Use
assign()
to overwrite theRows_with_any_blank
column to be a list of the non-nullIndex
values for each group - Use
assign()
to create and populate columnsTotal_Blanks_on_Code
andTotal_Blanks_on_Signed
- Keep only rows where the count in
both_filled
< the count inName
(which is the total number of rows in the original df); this removes anyCountry
for which allCode
andSigned
rows are filled - Select the 5 desired columns in the specified order using
[[]]
- Use
reset_index()
to switchCountry
from the index to a column - Use
rename()
to changeName
andboth_filled
to have the specified labelsTotal_of_rows_of_the_Country
andTotal_of_rows_with_both_values_filled
.
CodePudding user response:
Based on the definitions/conditions you gave, the country AE should have a total blanks on Code equal to 0 and not 2.
Anyway, you can use the code below to get the format of output you're looking for :
out = (df.assign(Total1 = df['Code'].isna(),
Total2 = df['Signed?'].isna(),
Total3 = ~df['Code'].isna() & ~df['Signed?'].isna())
.groupby('Country', as_index=False)
.agg(NumberOfCountries = ('Country','size'),
Total1 = ('Total1','sum'),
Total2 = ('Total2','sum'),
Total3 = ('Total3','sum'))
).rename(columns={'Total1': 'Total Blanks on Code', 'Total2': 'Total Blanks on Signed?',
'Total3': 'Total of rows with both values filled', 'NumberOfCountries': 'Total of rows of the Country'})