I have a dataframe y
:
Table 1: Main dataframe y
Numero | Date | location |
---|---|---|
1111 | 01/11/2009 | 0065;686006;AN |
1111 | 01/11/2009 | 0065;686006;P6 |
1111 | 05/11/2009 | 0065;686675;1P |
1111 | 05/11/2009 | 0065;688575;RE |
1111 | 06/11/2009 | 0065;686675;1P |
1111 | 05/11/2009 | 0065;686006;AN |
5654 | 15/11/2009 | 0065;686675;1P |
5654 | 15/11/2009 | 0065;565845;2K |
5654 | 26/11/2009 | 0065;154224;5O |
4581 | 05/12/2009 | 0065;686006;AN |
Table 2: Conditions dataframe cond
Info 1 | Info 2 |
---|---|
686006;AN | 686006;P6 |
565845;2K | 686675;1P |
So I want to create a new table with the same numero in same date pass through two points "686006;AN and 686006;P6"
or "565845;2K and 686675;1P"
.
Desired output:
Numero | Date | location |
---|---|---|
1111 | 01/11/2009 | 0065;686006;AN |
1111 | 01/11/2009 | 0065;686006;P6 |
5654 | 15/11/2009 | 0065;686675;1P |
5654 | 15/11/2009 | 0065;565845;2K |
dr = y.groupby('Numero')
my_set = []
for a, b in itertools.combinations(y['Date'], 2):
if op.eq(a,b) == True:
for elem, group in dr:
for j in range(len(cond)):
if group['location'].str.contains(cond['Info 1'][j]).any() and group['location'].str.contains(cond['Info 2'][j]).any():
my_set.append(elem)
my_train
data_= y.loc[y['Numero'].isin(my_train)]
print(data_)
But it doesn't work because large data so out memories and it doesn't print the date and only the value containing the info but all the values of that element.
THANK YOU SO MUCH!!!!!
CodePudding user response:
You can approach by the following steps:
Step 1: Create a helper column to extract from location
the substring in same format as Info
in cond
y['loc_info'] = y['location'].str.split(';', n=1).str[-1]
Step 2: Create a list of 2 points from cond
cond_list = cond.apply(set, axis=1).to_list()
Step 3: boolean mask for same 'Numero' in same 'Date' pass through 2 points
mask = y.groupby(['Numero', 'Date'])['loc_info'].agg(set).apply(lambda x: any([y.issubset(x) for y in cond_list]))
Step 4: locate entries fulfilling boolean mask condition
df_out = y.set_index(['Numero', 'Date']).loc[mask].reset_index()
Full set of codes:
# Create a helper column to extract from `location` the substring in same format as `Info` in `cond`
y['loc_info'] = y['location'].str.split(';', n=1).str[-1]
# create a list of 2 points from `cond`
cond_list = cond.apply(set, axis=1).to_list()
# boolean mask for same 'Numero' in same 'Date' pass through 2 points
mask = y.groupby(['Numero', 'Date'])['loc_info'].agg(set).apply(lambda x: any([y.issubset(x) for y in cond_list]))
# locate entries fulfilling boolean mask condition
df_out = y.set_index(['Numero', 'Date']).loc[mask].reset_index()
Result
print(df_out)
Numero Date location loc_info
0 1111 01/11/2009 0065;686006;AN 686006;AN
1 1111 01/11/2009 0065;686006;P6 686006;P6
2 5654 15/11/2009 0065;686675;1P 686675;1P
3 5654 15/11/2009 0065;565845;2K 565845;2K
You can drop the helper column loc_info
, as follows:
df_out = df_out.drop('loc_info', axis=1)
Result:
print(df_out)
Numero Date location
0 1111 01/11/2009 0065;686006;AN
1 1111 01/11/2009 0065;686006;P6
2 5654 15/11/2009 0065;686675;1P
3 5654 15/11/2009 0065;565845;2K