Home > Software design >  Python Pandas: Filter in dataframe matching a set of multiple items / conditions from another datafr
Python Pandas: Filter in dataframe matching a set of multiple items / conditions from another datafr

Time:10-07

I have a dataframe y:

Table 1: Main dataframe y

Numero Date location
1111 01/11/2009 0065;686006;AN
1111 01/11/2009 0065;686006;P6
1111 05/11/2009 0065;686675;1P
1111 05/11/2009 0065;688575;RE
1111 06/11/2009 0065;686675;1P
1111 05/11/2009 0065;686006;AN
5654 15/11/2009 0065;686675;1P
5654 15/11/2009 0065;565845;2K
5654 26/11/2009 0065;154224;5O
4581 05/12/2009 0065;686006;AN

Table 2: Conditions dataframe cond

Info 1 Info 2
686006;AN 686006;P6
565845;2K 686675;1P

So I want to create a new table with the same numero in same date pass through two points "686006;AN and 686006;P6" or "565845;2K and 686675;1P".

Desired output:

Numero Date location
1111 01/11/2009 0065;686006;AN
1111 01/11/2009 0065;686006;P6
5654 15/11/2009 0065;686675;1P
5654 15/11/2009 0065;565845;2K
dr = y.groupby('Numero')
my_set = []

for a, b in itertools.combinations(y['Date'], 2):
    if op.eq(a,b) == True:
        for elem, group in dr:
            for j in range(len(cond)):
                if group['location'].str.contains(cond['Info 1'][j]).any() and group['location'].str.contains(cond['Info 2'][j]).any():
                    my_set.append(elem)

                
my_train
data_= y.loc[y['Numero'].isin(my_train)]
print(data_)     

But it doesn't work because large data so out memories and it doesn't print the date and only the value containing the info but all the values of that element.

THANK YOU SO MUCH!!!!!

CodePudding user response:

You can approach by the following steps:

Step 1: Create a helper column to extract from location the substring in same format as Info in cond

y['loc_info'] = y['location'].str.split(';', n=1).str[-1]

Step 2: Create a list of 2 points from cond

cond_list = cond.apply(set, axis=1).to_list()

Step 3: boolean mask for same 'Numero' in same 'Date' pass through 2 points

mask = y.groupby(['Numero', 'Date'])['loc_info'].agg(set).apply(lambda x: any([y.issubset(x) for y in cond_list]))

Step 4: locate entries fulfilling boolean mask condition

df_out = y.set_index(['Numero', 'Date']).loc[mask].reset_index()

Full set of codes:

# Create a helper column to extract from `location` the substring in same format as `Info` in `cond`
y['loc_info'] = y['location'].str.split(';', n=1).str[-1]

# create a list of 2 points from `cond`
cond_list = cond.apply(set, axis=1).to_list()

# boolean mask for same 'Numero' in same 'Date' pass through 2 points
mask = y.groupby(['Numero', 'Date'])['loc_info'].agg(set).apply(lambda x: any([y.issubset(x) for y in cond_list]))

# locate entries fulfilling boolean mask condition
df_out = y.set_index(['Numero', 'Date']).loc[mask].reset_index()

Result

print(df_out)

   Numero        Date        location   loc_info
0    1111  01/11/2009  0065;686006;AN  686006;AN
1    1111  01/11/2009  0065;686006;P6  686006;P6
2    5654  15/11/2009  0065;686675;1P  686675;1P
3    5654  15/11/2009  0065;565845;2K  565845;2K

You can drop the helper column loc_info, as follows:

df_out = df_out.drop('loc_info', axis=1)

Result:

print(df_out)

   Numero        Date        location
0    1111  01/11/2009  0065;686006;AN
1    1111  01/11/2009  0065;686006;P6
2    5654  15/11/2009  0065;686675;1P
3    5654  15/11/2009  0065;565845;2K
  • Related