I have two dataframes, like below
rows_df
Id Location Age
0 0 30
1 1 US 20
2 2
requiredCols_df
RequiredColumn
0 Location
requiredCols_df
specifies which column is required in rows_df
.
In this example, Location
is required, Age
is optional in rows_df
.
I want to filter the rows_df
based on the requiredCols_df
. So there will be two resulting dataframes. One contains rows that have the required columns, and the other dataframe contains rows that don't have any required columns.
Expected Result
Rows matched
Id Location Age
1 1 US 20
Rows don't match
Id Location Age
0 0 30
2 2
Please note that:
1.The rows_df
contains more than two columns, e.g. 10-30 columns.
2 The requiredCols_df
contains more than one row.
2 Please note that Location
contains a ' ' (empty space(s)) in row 0
, and a null
(empty) in row 2
.
rows_df = pd.DataFrame({'Id':['0','1','2'],
'Location': [' ', 'US', None],
'Age':['30','20','']})
column names below specify which column in rows_df
must be not empty
requiredCols_df = pd.DataFrame([['Location']],
columns= ['RequiredColumn'])
I can do this with a loop, but I want to see if there is a better solution.
CodePudding user response:
If you correct those to be true nans...
req_cols = requiredCols_df.RequiredColumn
rows_df[req_cols] = rows_df[req_cols].replace([r'^\s*$', '', None], np.nan, regex=True)
Id Location Age
0 0 NaN 30
1 1 US 20
2 2 NaN
Then this is simple:
matched = rows_df.dropna(subset=req_cols)
not_matched = rows_df[~rows_df.eq(matched).all(axis=1)]
print(matched)
print(not_matched)
# Output:
Id Location Age
1 1 US 20
Id Location Age
0 0 NaN 30
2 2 NaN
CodePudding user response:
Assuming you don't have a tremendous count of columns are there, this seems like a task for a loop
df_result = rows_df
for column in requiredCols_df["RequiredColumn"]:
df_result = df_result[df_result[column].notnull()]
CodePudding user response:
With this setup:
import numpy as np
import pandas as pd
rows_df = pd.DataFrame(
{
"Id": [0, 1, 2],
"Location": [np.NaN, "US", np.NaN],
"Age": [30, 20, np.NaN]
}
)
requiredCols_df = pd.DataFrame(
{
"RequiredColumn": ["Location"]
}
)
We can easily form a boolean series to select what we want:
required_column_not_null = rows_df[requiredCols_df["RequiredColumn"]].notnull().all(axis=1)
rows_df[required_column_not_null]
Id Location Age
1 1 US 20.0
required_column_null = ~required_column_not_null
rows_df[required_column_null]
Id Location Age
0 0 NaN 30.0
2 2 NaN NaN
CodePudding user response:
li = []
for column_name in requiredCols_df.RequiredColumn:
li.append(rows_df[(rows_df[column_name] != ' ') & (~rows_df[column_name].isnull())])
pd.concat(li)