Condionally Filtering Dataframe using Pandas-CodePudding

I have a dataframe df:

Store	Date	Question	Answer
1	44725	B	mango
1	44725	A	Yes
1	44725	E	Yes
1	44725	F	mango
1	44625	G	No
2	44625	A	Yes
2	44625	B
2	44625	C	Yes
2	44625	D	apple
2	44725	A	No
3	44725	A	Yes
3	44725	C	Yes
3	44725	D	strawberry
4	44726	C	Yes
4	44726	A	Yes

A,E,G,C are Questions and B,F,H,D are sub questions under Question respectively. For a store for a Date, If the Question is answered Yes,I want to check if follow up /sub questions are present for their Questions, If not I want to get the store,Date,Question pairs as a pandas dataframe. For example:

For store 2,Date 44625 Question B is null/blank,need this in a separate dataframe like null_df:

Store	Date	Question
2	44625	B

For store 3,Date 44725,Question A does not have the sub question so want this in a separate dataframe df_check:

Store	Date	Question
3	44725	A
4	44726	C
4	44726	A

For store 1 for Date 44725 All the questions have sub questions and answers. How do I get the above 2 dataframes?

CodePudding user response：

import pandas as pd
import numpy as np

df = pd.read_csv('df.csv', header=0)

aaa = {'A': 'B', 'E': 'F', 'G': 'H', 'C': 'D'}


null_df = pd.DataFrame()
df_check = pd.DataFrame()


def my_func(x):
    ind = x[x['Answer'] == 'Yes'].index
    if len(ind) > 0:
        bbb = np.array([[aaa[x.loc[i, 'Question']], x.loc[i, 'Question']] for i in ind])

        fff = x[x['Question'].isin(bbb[:, 0])]
        nu = fff[pd.isnull(fff['Answer'])]
        if len(nu) > 0:
            global null_df
            null_df = pd.concat([null_df, nu[['Store', 'Date', 'Question']]], ignore_index=True)

        check_ind = np.in1d(bbb[:, 0], x['Question'])
        check_ans = bbb[~ check_ind][:, 1]
        check = x[x['Question'].isin(check_ans)]
        if len(check) > 0:
            global df_check
            df_check = pd.concat([df_check, check[['Store', 'Date', 'Question']]], ignore_index=True)


df.groupby(['Store', 'Date']).apply(my_func)

print(null_df)
print(df_check)

Output

   Store   Date Question
0      2  44625        B


   Store   Date Question
0      3  44725        A
1      4  44726        C
2      4  44726        A

A dictionary aaa is created in which the keys are questions, and the values are sub-questions.

The dataframe is grouped by the 'Store', 'Date' columns. Next, apply with a custom function(my_func). Rows where x['Answer'] == 'Yes' are selected to get indexes. If there are indexes, then we fall under the condition.

A list of bbb sub-questions is created on the left, questions on the right (sub-questions are obtained through the aaa dictionary).

null_df: The fff list gets subquestions. With the isin function, which returns a boolean mask. In nu we get rows where Answer is NaN.

df_check: check_ind uses np.in1d to check if the selected sub-questions match the real ones. check_ans get a list of questions without sub-questions(apply tilde ~ get mismatched indices).check we select lines with the necessary questions using isin in which the list of questions is substituted.

pd.concat is used to connect both dataframes.

If something is not clear, you can print the data of each line of code and see the result.