I have a data frame data_set
and two infos important in a list
cond_list = [{'LQ','DA'},{'HJ','OP'}]
.
Report all information between the two values LQ and DA
OR HJ and OP
in new column 'Another Pass' and which condition in the same day and same id.
Example:
Let i, j correspond to the Rank of LQ and DA respective. i<j
ID 1552, date 1/4/2020, we have Info['LQ'] correspond to Rank = 1 and Info['DA'] correspond to Rank = 4 --> So all information 'Another pass' include [LA, BA] because Rank DA < Rank LA, Rank BA < Rank LQ
ID 1552, date 5/4/2020, we have Info['LQ'] correspond to Rank = 3 and Info['DA'] correspond to Rank = 7 --> So all information 'Another pass' include [VT,AN,VB] because Rank DA < Rank VT, Rank AN, Rank VB < Rank LQ
ID 1697, date 15/4/2020, we have Info['LQ'] correspond to Rank = 1 and Info['DA'] correspond to Rank = 4 but there is no information in between these two points so 'Another pass' is empty
Input:
ID | Date | Rank | Info | Horaire | Type | Note |
---|---|---|---|---|---|---|
1552 | 1/4/2020 | 1 | LQ | 10:00 | D | LVM |
1552 | 1/4/2020 | 1 | LQ | 10:10 | A | LVM |
1552 | 1/4/2020 | 2 | LA | 10:12 | P | |
1552 | 1/4/2020 | 3 | BA | 10:15 | P | |
1552 | 1/4/2020 | 4 | DA | 10:25 | A | LVD |
1552 | 5/4/2020 | 1 | DT | 11:30 | D | |
1552 | 5/4/2020 | 2 | GR | 11:33 | P | |
1552 | 5/4/2020 | 3 | LQ | 11:35 | D | LDT |
1552 | 5/4/2020 | 3 | LQ | 11:38 | A | |
1552 | 5/4/2020 | 4 | VT | 11:40 | P | |
1552 | 5/4/2020 | 5 | AN | 11:43 | P | |
1552 | 5/4/2020 | 6 | VB | 11:46 | P | |
1552 | 5/4/2020 | 7 | DA | 11:55 | A | LDF |
1552 | 5/4/2020 | 7 | DA | 11:59 | D | |
1552 | 5/4/2020 | 8 | AT | 12:15 | A | |
1697 | 15/4/2020 | 1 | HJ | 10:00 | D | LVM |
1697 | 15/4/2020 | 4 | OP | 11:00 | A | LVM |
i filtered a table of values containing only two condition elements:
mask = df.groupby(['ID', 'Date'])['Info'].agg(set).apply(lambda x: any([y.issubset(x) for y in cond_list]))
data_set = df.set_index(['ID', 'Date']).loc[mask].reset_index()
dt_final = data_set.loc[data_set['Info'].isin(cond.values.ravel())]
I gain a new data frame 'dt_final':
ID | Date | Rank | Info | Horaire | Type | Note |
---|---|---|---|---|---|---|
1552 | 1/4/2020 | 1 | LQ | 10:00 | D | LVM |
1552 | 1/4/2020 | 1 | LQ | 10:10 | A | LVM |
1552 | 1/4/2020 | 4 | DA | 10:25 | A | LVD |
1552 | 5/4/2020 | 3 | LQ | 11:35 | D | LDT |
1552 | 5/4/2020 | 3 | LQ | 11:38 | A | |
1552 | 5/4/2020 | 7 | DA | 11:55 | A | LDF |
1552 | 5/4/2020 | 7 | DA | 11:59 | D | |
1697 | 15/4/2020 | 1 | HJ | 10:00 | D | LVM |
1697 | 15/4/2020 | 4 | OP | 11:00 | A | LVM |
I say an array of values 'Rank':
cd = dt_final.groupby(["ID", "Date"])["Rank"].agg(list).tolist()
i filter the values in between:
for i in cd:
pr['Another Pass'] = data_set.loc[data_set.Rank.between(i[0],i[-1])].groupby(['ID', 'Date ']).agg({'Info':list})
Although it declares all information by the same date and same id, not the required values between two points in the condition cond_list and like a example :(
CodePudding user response:
Since the same ID and Date cannot have both sets of values, you can try:
cond_list = [{'LQ','DA'},{'HJ','OP'}]
firsts = [s[0] for s in cond_list]
lasts = [s[-1] for s in cond_list]
another_pass= df.groupby(["ID", "Date"])
.apply(lambda x: x[x["Rank"].between(x[x["Info"].isin(firsts)]["Rank"].min(),
x[x["Info"].isin(lasts)]["Rank"].max(),
inclusive="neither")]["Info"].tolist())
.rename("Another Pass")
>>> output
ID Date
1552 1/4/2020 [LA, BA]
5/4/2020 [VT, AN, VB]
1697 15/4/2020 []
If you want to merge this back to your original DataFrame to create the new column:
output = df.merge(another_pass, left_on = ["ID", "Date"], right_index=True)
>>> output
ID Date Rank Info Horaire Type Note Another Pass
0 1552 1/4/2020 1 LQ 10:00 D LVM [LA, BA]
1 1552 1/4/2020 1 LQ 10:10 A LVM [LA, BA]
2 1552 1/4/2020 2 LA 10:12 P NaN [LA, BA]
3 1552 1/4/2020 3 BA 10:15 P NaN [LA, BA]
4 1552 1/4/2020 4 DA 10:25 A LVD [LA, BA]
5 1552 5/4/2020 1 DT 11:30 D NaN [VT, AN, VB]
6 1552 5/4/2020 2 GR 11:33 P NaN [VT, AN, VB]
7 1552 5/4/2020 3 LQ 11:35 D LDT [VT, AN, VB]
8 1552 5/4/2020 3 LQ 11:38 A NaN [VT, AN, VB]
9 1552 5/4/2020 4 VT 11:40 P NaN [VT, AN, VB]
10 1552 5/4/2020 5 AN 11:43 P NaN [VT, AN, VB]
11 1552 5/4/2020 6 VB 11:46 P NaN [VT, AN, VB]
12 1552 5/4/2020 7 DA 11:55 A LDF [VT, AN, VB]
13 1552 5/4/2020 7 DA 11:59 D NaN [VT, AN, VB]
14 1552 5/4/2020 8 AT 12:15 A NaN [VT, AN, VB]
15 1697 15/4/2020 1 HJ 10:00 D LVM []
16 1697 15/4/2020 4 OP 11:00 A LVM []
Edit:
For the required rows from the output, you can do:
>>> output[((output["Info"].isin(["LQ", "HJ"])) &
(output["Type"]=="D")) |
((output["Info"].isin(["DA", "OP"])) &
(output["Type"]=="A"))]
ID Date Rank Info Horaire Type Note Another Pass
0 1552 1/4/2020 1 LQ 10:00 D LVM [LA, BA]
4 1552 1/4/2020 4 DA 10:25 A LVD [LA, BA]
7 1552 5/4/2020 3 LQ 11:35 D LDT [VT, AN, VB]
12 1552 5/4/2020 7 DA 11:55 A LDF [VT, AN, VB]
15 1697 15/4/2020 1 HJ 10:00 D LVM []
16 1697 15/4/2020 4 OP 11:00 A LVM []