I have a dataframe df
:
Store | Date | Question | Answer |
---|---|---|---|
1 | 44725 | B | mango |
1 | 44725 | A | Yes |
1 | 44725 | E | Yes |
1 | 44725 | F | mango |
1 | 44625 | G | No |
2 | 44625 | A | Yes |
2 | 44625 | B | |
2 | 44625 | C | Yes |
2 | 44625 | D | apple |
2 | 44725 | A | No |
3 | 44725 | A | Yes |
3 | 44725 | C | Yes |
3 | 44725 | D | strawberry |
4 | 44726 | C | Yes |
4 | 44726 | A | Yes |
A,E,G,C are Questions and B,F,H,D are sub questions under Question
respectively.
For a store
for a Date
, If the Question is answered Yes
,I want to check if follow up /sub questions are present for their Questions, If not I want to get the store,Date,Question pairs as a pandas dataframe.
For example:
For store
2,Date
44625 Question
B is null/blank,need this in a separate dataframe like null_df
:
Store | Date | Question |
---|---|---|
2 | 44625 | B |
For store
3,Date
44725,Question
A does not have the sub question so want this in a separate dataframe df_check
:
Store | Date | Question |
---|---|---|
3 | 44725 | A |
4 | 44726 | C |
4 | 44726 | A |
For store
1 for Date
44725 All the questions have sub questions and answers.
How do I get the above 2 dataframes?
CodePudding user response:
import pandas as pd
import numpy as np
df = pd.read_csv('df.csv', header=0)
aaa = {'A': 'B', 'E': 'F', 'G': 'H', 'C': 'D'}
null_df = pd.DataFrame()
df_check = pd.DataFrame()
def my_func(x):
ind = x[x['Answer'] == 'Yes'].index
if len(ind) > 0:
bbb = np.array([[aaa[x.loc[i, 'Question']], x.loc[i, 'Question']] for i in ind])
fff = x[x['Question'].isin(bbb[:, 0])]
nu = fff[pd.isnull(fff['Answer'])]
if len(nu) > 0:
global null_df
null_df = pd.concat([null_df, nu[['Store', 'Date', 'Question']]], ignore_index=True)
check_ind = np.in1d(bbb[:, 0], x['Question'])
check_ans = bbb[~ check_ind][:, 1]
check = x[x['Question'].isin(check_ans)]
if len(check) > 0:
global df_check
df_check = pd.concat([df_check, check[['Store', 'Date', 'Question']]], ignore_index=True)
df.groupby(['Store', 'Date']).apply(my_func)
print(null_df)
print(df_check)
Output
Store Date Question
0 2 44625 B
Store Date Question
0 3 44725 A
1 4 44726 C
2 4 44726 A
A dictionary aaa
is created in which the keys are questions
, and the values are sub-questions
.
The dataframe is grouped by the 'Store', 'Date' columns. Next, apply
with a custom function(my_func
).
Rows where x['Answer'] == 'Yes'
are selected to get indexes
. If there are indexes, then we fall under the condition.
A list of bbb
sub-questions is created on the left, questions
on the right (sub-questions
are obtained through the aaa
dictionary).
null_df:
The fff
list gets subquestions. With the isin function, which returns a boolean mask. In nu we get rows where Answer is NaN.
df_check:
check_ind uses np.in1d to check if the selected sub-questions match the real ones. check_ans get a list of questions without sub-questions(apply tilde ~
get mismatched indices).check we select lines with the necessary questions using isin in which the list of questions is substituted.
pd.concat is used to connect both dataframes.
If something is not clear, you can print
the data of each line of code and see the result.