Python Pandas: Filter in dataframe matching a set of multiple items / conditions from another datafr-CodePudding

I have a dataframe y:

Table 1: Main dataframe y

Numero	Date	location
1111	01/11/2009	0065;686006;AN
1111	01/11/2009	0065;686006;P6
1111	05/11/2009	0065;686675;1P
1111	05/11/2009	0065;688575;RE
1111	06/11/2009	0065;686675;1P
1111	05/11/2009	0065;686006;AN
5654	15/11/2009	0065;686675;1P
5654	15/11/2009	0065;565845;2K
5654	26/11/2009	0065;154224;5O
4581	05/12/2009	0065;686006;AN

Table 2: Conditions dataframe cond

Info 1	Info 2
686006;AN	686006;P6
565845;2K	686675;1P

So I want to create a new table with the same numero in same date pass through two points "686006;AN and 686006;P6" or "565845;2K and 686675;1P".

Desired output:

Numero	Date	location
1111	01/11/2009	0065;686006;AN
1111	01/11/2009	0065;686006;P6
5654	15/11/2009	0065;686675;1P
5654	15/11/2009	0065;565845;2K

dr = y.groupby('Numero')
my_set = []

for a, b in itertools.combinations(y['Date'], 2):
    if op.eq(a,b) == True:
        for elem, group in dr:
            for j in range(len(cond)):
                if group['location'].str.contains(cond['Info 1'][j]).any() and group['location'].str.contains(cond['Info 2'][j]).any():
                    my_set.append(elem)

                
my_train
data_= y.loc[y['Numero'].isin(my_train)]
print(data_)

But it doesn't work because large data so out memories and it doesn't print the date and only the value containing the info but all the values of that element.

THANK YOU SO MUCH!!!!!

CodePudding user response：

You can approach by the following steps:

Step 1: Create a helper column to extract from location the substring in same format as Info in cond

y['loc_info'] = y['location'].str.split(';', n=1).str[-1]

Step 2: Create a list of 2 points from cond

cond_list = cond.apply(set, axis=1).to_list()

Step 3: boolean mask for same 'Numero' in same 'Date' pass through 2 points

mask = y.groupby(['Numero', 'Date'])['loc_info'].agg(set).apply(lambda x: any([y.issubset(x) for y in cond_list]))

Step 4: locate entries fulfilling boolean mask condition

df_out = y.set_index(['Numero', 'Date']).loc[mask].reset_index()

Full set of codes:

# Create a helper column to extract from `location` the substring in same format as `Info` in `cond`
y['loc_info'] = y['location'].str.split(';', n=1).str[-1]

# create a list of 2 points from `cond`
cond_list = cond.apply(set, axis=1).to_list()

# boolean mask for same 'Numero' in same 'Date' pass through 2 points
mask = y.groupby(['Numero', 'Date'])['loc_info'].agg(set).apply(lambda x: any([y.issubset(x) for y in cond_list]))

# locate entries fulfilling boolean mask condition
df_out = y.set_index(['Numero', 'Date']).loc[mask].reset_index()

Result

print(df_out)

   Numero        Date        location   loc_info
0    1111  01/11/2009  0065;686006;AN  686006;AN
1    1111  01/11/2009  0065;686006;P6  686006;P6
2    5654  15/11/2009  0065;686675;1P  686675;1P
3    5654  15/11/2009  0065;565845;2K  565845;2K

You can drop the helper column loc_info, as follows:

df_out = df_out.drop('loc_info', axis=1)

Result:

print(df_out)

   Numero        Date        location
0    1111  01/11/2009  0065;686006;AN
1    1111  01/11/2009  0065;686006;P6
2    5654  15/11/2009  0065;686675;1P
3    5654  15/11/2009  0065;565845;2K