Create new column between two elements and between() python-CodePudding

I have a data frame data_set and two infos important in a list

cond_list = [{'LQ','DA'},{'HJ','OP'}].

Report all information between the two values LQ and DA OR HJ and OP in new column 'Another Pass' and which condition in the same day and same id.

Example:

Let i, j correspond to the Rank of LQ and DA respective. i<j

ID 1552, date 1/4/2020, we have Info['LQ'] correspond to Rank = 1 and Info['DA'] correspond to Rank = 4 --> So all information 'Another pass' include [LA, BA] because Rank DA < Rank LA, Rank BA < Rank LQ
ID 1552, date 5/4/2020, we have Info['LQ'] correspond to Rank = 3 and Info['DA'] correspond to Rank = 7 --> So all information 'Another pass' include [VT,AN,VB] because Rank DA < Rank VT, Rank AN, Rank VB < Rank LQ
ID 1697, date 15/4/2020, we have Info['LQ'] correspond to Rank = 1 and Info['DA'] correspond to Rank = 4 but there is no information in between these two points so 'Another pass' is empty

Input:

ID	Date	Rank	Info	Horaire	Type	Note
1552	1/4/2020	1	LQ	10:00	D	LVM
1552	1/4/2020	1	LQ	10:10	A	LVM
1552	1/4/2020	2	LA	10:12	P
1552	1/4/2020	3	BA	10:15	P
1552	1/4/2020	4	DA	10:25	A	LVD
1552	5/4/2020	1	DT	11:30	D
1552	5/4/2020	2	GR	11:33	P
1552	5/4/2020	3	LQ	11:35	D	LDT
1552	5/4/2020	3	LQ	11:38	A
1552	5/4/2020	4	VT	11:40	P
1552	5/4/2020	5	AN	11:43	P
1552	5/4/2020	6	VB	11:46	P
1552	5/4/2020	7	DA	11:55	A	LDF
1552	5/4/2020	7	DA	11:59	D
1552	5/4/2020	8	AT	12:15	A
1697	15/4/2020	1	HJ	10:00	D	LVM
1697	15/4/2020	4	OP	11:00	A	LVM

i filtered a table of values containing only two condition elements:

mask = df.groupby(['ID', 'Date'])['Info'].agg(set).apply(lambda x: any([y.issubset(x) for y in cond_list]))
data_set = df.set_index(['ID', 'Date']).loc[mask].reset_index()
dt_final = data_set.loc[data_set['Info'].isin(cond.values.ravel())]

I gain a new data frame 'dt_final':

ID	Date	Rank	Info	Horaire	Type	Note
1552	1/4/2020	1	LQ	10:00	D	LVM
1552	1/4/2020	1	LQ	10:10	A	LVM
1552	1/4/2020	4	DA	10:25	A	LVD
1552	5/4/2020	3	LQ	11:35	D	LDT
1552	5/4/2020	3	LQ	11:38	A
1552	5/4/2020	7	DA	11:55	A	LDF
1552	5/4/2020	7	DA	11:59	D
1697	15/4/2020	1	HJ	10:00	D	LVM
1697	15/4/2020	4	OP	11:00	A	LVM

I say an array of values 'Rank':

cd = dt_final.groupby(["ID", "Date"])["Rank"].agg(list).tolist()

i filter the values in between:

for i in cd:
    pr['Another Pass'] = data_set.loc[data_set.Rank.between(i[0],i[-1])].groupby(['ID', 'Date ']).agg({'Info':list})

Although it declares all information by the same date and same id, not the required values between two points in the condition cond_list and like a example :(

CodePudding user response：

Since the same ID and Date cannot have both sets of values, you can try:

cond_list = [{'LQ','DA'},{'HJ','OP'}]
firsts = [s[0] for s in cond_list]
lasts = [s[-1] for s in cond_list]

another_pass= df.groupby(["ID", "Date"])
                .apply(lambda x: x[x["Rank"].between(x[x["Info"].isin(firsts)]["Rank"].min(), 
                                                     x[x["Info"].isin(lasts)]["Rank"].max(), 
                                                     inclusive="neither")]["Info"].tolist())
                .rename("Another Pass")

>>> output

ID    Date     
1552  1/4/2020         [LA, BA]
      5/4/2020     [VT, AN, VB]
1697  15/4/2020              []

If you want to merge this back to your original DataFrame to create the new column:

output = df.merge(another_pass, left_on = ["ID", "Date"], right_index=True)

>>> output

      ID       Date  Rank Info Horaire Type Note  Another Pass
0   1552   1/4/2020     1   LQ   10:00    D  LVM      [LA, BA]
1   1552   1/4/2020     1   LQ   10:10    A  LVM      [LA, BA]
2   1552   1/4/2020     2   LA   10:12    P  NaN      [LA, BA]
3   1552   1/4/2020     3   BA   10:15    P  NaN      [LA, BA]
4   1552   1/4/2020     4   DA   10:25    A  LVD      [LA, BA]
5   1552   5/4/2020     1   DT   11:30    D  NaN  [VT, AN, VB]
6   1552   5/4/2020     2   GR   11:33    P  NaN  [VT, AN, VB]
7   1552   5/4/2020     3   LQ   11:35    D  LDT  [VT, AN, VB]
8   1552   5/4/2020     3   LQ   11:38    A  NaN  [VT, AN, VB]
9   1552   5/4/2020     4   VT   11:40    P  NaN  [VT, AN, VB]
10  1552   5/4/2020     5   AN   11:43    P  NaN  [VT, AN, VB]
11  1552   5/4/2020     6   VB   11:46    P  NaN  [VT, AN, VB]
12  1552   5/4/2020     7   DA   11:55    A  LDF  [VT, AN, VB]
13  1552   5/4/2020     7   DA   11:59    D  NaN  [VT, AN, VB]
14  1552   5/4/2020     8   AT   12:15    A  NaN  [VT, AN, VB]
15  1697  15/4/2020     1   HJ   10:00    D  LVM            []
16  1697  15/4/2020     4   OP   11:00    A  LVM            []

Edit:

For the required rows from the output, you can do:

>>> output[((output["Info"].isin(["LQ", "HJ"])) & 
            (output["Type"]=="D")) | 
           ((output["Info"].isin(["DA", "OP"])) & 
            (output["Type"]=="A"))]

      ID       Date  Rank Info Horaire Type Note  Another Pass
0   1552   1/4/2020     1   LQ   10:00    D  LVM      [LA, BA]
4   1552   1/4/2020     4   DA   10:25    A  LVD      [LA, BA]
7   1552   5/4/2020     3   LQ   11:35    D  LDT  [VT, AN, VB]
12  1552   5/4/2020     7   DA   11:55    A  LDF  [VT, AN, VB]
15  1697  15/4/2020     1   HJ   10:00    D  LVM            []
16  1697  15/4/2020     4   OP   11:00    A  LVM            []