Select data from two dataframes into two dataframes with Pandas DataFrame and Python-CodePudding

I have two dataframes, like below

rows_df

  Id Location Age
0  0           30
1  1       US  20
2  2

requiredCols_df

  RequiredColumn
0       Location

requiredCols_df specifies which column is required in rows_df. In this example, Location is required, Age is optional in rows_df.

I want to filter the rows_df based on the requiredCols_df. So there will be two resulting dataframes. One contains rows that have the required columns, and the other dataframe contains rows that don't have any required columns.

Expected Result

Rows matched

  Id Location Age
1  1       US  20

Rows don't match

  Id Location Age
0  0           30
2  2

Please note that:

1.The rows_df contains more than two columns, e.g. 10-30 columns.

2 The requiredCols_df contains more than one row.

2 Please note that Location contains a ' ' (empty space(s)) in row 0, and a null (empty) in row 2.

rows_df = pd.DataFrame({'Id':['0','1','2'],
                    'Location': [' ', 'US', None], 
                        'Age':['30','20','']})

column names below specify which column in rows_df must be not empty

requiredCols_df = pd.DataFrame([['Location']],
                            columns= ['RequiredColumn'])

I can do this with a loop, but I want to see if there is a better solution.

CodePudding user response：

If you correct those to be true nans...

req_cols = requiredCols_df.RequiredColumn
rows_df[req_cols] = rows_df[req_cols].replace([r'^\s*$', '', None], np.nan, regex=True)

  Id Location Age
0  0      NaN  30
1  1       US  20
2  2      NaN

Then this is simple:

matched = rows_df.dropna(subset=req_cols)
not_matched = rows_df[~rows_df.eq(matched).all(axis=1)]
print(matched)
print(not_matched)

# Output:
  Id Location Age
1  1       US  20

  Id Location Age
0  0      NaN  30
2  2      NaN

CodePudding user response：

Assuming you don't have a tremendous count of columns are there, this seems like a task for a loop

df_result = rows_df
for column in requiredCols_df["RequiredColumn"]:
    df_result = df_result[df_result[column].notnull()]

CodePudding user response：

With this setup:

import numpy as np
import pandas as pd


rows_df = pd.DataFrame(
    {
        "Id": [0, 1, 2],
        "Location": [np.NaN, "US", np.NaN],
        "Age": [30, 20, np.NaN]
    }
)

requiredCols_df = pd.DataFrame(
    {
        "RequiredColumn": ["Location"]
    }
)

We can easily form a boolean series to select what we want:

required_column_not_null = rows_df[requiredCols_df["RequiredColumn"]].notnull().all(axis=1)

rows_df[required_column_not_null]

   Id Location   Age
1   1       US  20.0

required_column_null = ~required_column_not_null

rows_df[required_column_null]

   Id Location   Age
0   0      NaN  30.0
2   2      NaN   NaN

CodePudding user response：

li = []

for column_name in requiredCols_df.RequiredColumn:
    li.append(rows_df[(rows_df[column_name] != ' ') & (~rows_df[column_name].isnull())])

pd.concat(li)