Home > Blockchain >  Create new column between two elements and between() python
Create new column between two elements and between() python

Time:10-27

I have a data frame data_set and two infos important in a list

cond_list = [{'LQ','DA'},{'HJ','OP'}].

Report all information between the two values LQ and DA OR HJ and OP in new column 'Another Pass' and which condition in the same day and same id.

Example:

Let i, j correspond to the Rank of LQ and DA respective. i<j

  • ID 1552, date 1/4/2020, we have Info['LQ'] correspond to Rank = 1 and Info['DA'] correspond to Rank = 4 --> So all information 'Another pass' include [LA, BA] because Rank DA < Rank LA, Rank BA < Rank LQ

  • ID 1552, date 5/4/2020, we have Info['LQ'] correspond to Rank = 3 and Info['DA'] correspond to Rank = 7 --> So all information 'Another pass' include [VT,AN,VB] because Rank DA < Rank VT, Rank AN, Rank VB < Rank LQ

  • ID 1697, date 15/4/2020, we have Info['LQ'] correspond to Rank = 1 and Info['DA'] correspond to Rank = 4 but there is no information in between these two points so 'Another pass' is empty

Input:

ID Date Rank Info Horaire Type Note
1552 1/4/2020 1 LQ 10:00 D LVM
1552 1/4/2020 1 LQ 10:10 A LVM
1552 1/4/2020 2 LA 10:12 P
1552 1/4/2020 3 BA 10:15 P
1552 1/4/2020 4 DA 10:25 A LVD
1552 5/4/2020 1 DT 11:30 D
1552 5/4/2020 2 GR 11:33 P
1552 5/4/2020 3 LQ 11:35 D LDT
1552 5/4/2020 3 LQ 11:38 A
1552 5/4/2020 4 VT 11:40 P
1552 5/4/2020 5 AN 11:43 P
1552 5/4/2020 6 VB 11:46 P
1552 5/4/2020 7 DA 11:55 A LDF
1552 5/4/2020 7 DA 11:59 D
1552 5/4/2020 8 AT 12:15 A
1697 15/4/2020 1 HJ 10:00 D LVM
1697 15/4/2020 4 OP 11:00 A LVM

i filtered a table of values containing only two condition elements:

mask = df.groupby(['ID', 'Date'])['Info'].agg(set).apply(lambda x: any([y.issubset(x) for y in cond_list]))
data_set = df.set_index(['ID', 'Date']).loc[mask].reset_index()
dt_final = data_set.loc[data_set['Info'].isin(cond.values.ravel())]

I gain a new data frame 'dt_final':

ID Date Rank Info Horaire Type Note
1552 1/4/2020 1 LQ 10:00 D LVM
1552 1/4/2020 1 LQ 10:10 A LVM
1552 1/4/2020 4 DA 10:25 A LVD
1552 5/4/2020 3 LQ 11:35 D LDT
1552 5/4/2020 3 LQ 11:38 A
1552 5/4/2020 7 DA 11:55 A LDF
1552 5/4/2020 7 DA 11:59 D
1697 15/4/2020 1 HJ 10:00 D LVM
1697 15/4/2020 4 OP 11:00 A LVM

I say an array of values 'Rank':

cd = dt_final.groupby(["ID", "Date"])["Rank"].agg(list).tolist()

i filter the values in between:

for i in cd:
    pr['Another Pass'] = data_set.loc[data_set.Rank.between(i[0],i[-1])].groupby(['ID', 'Date ']).agg({'Info':list})

Although it declares all information by the same date and same id, not the required values between two points in the condition cond_list and like a example :(

CodePudding user response:

Since the same ID and Date cannot have both sets of values, you can try:

cond_list = [{'LQ','DA'},{'HJ','OP'}]
firsts = [s[0] for s in cond_list]
lasts = [s[-1] for s in cond_list]

another_pass= df.groupby(["ID", "Date"])
                .apply(lambda x: x[x["Rank"].between(x[x["Info"].isin(firsts)]["Rank"].min(), 
                                                     x[x["Info"].isin(lasts)]["Rank"].max(), 
                                                     inclusive="neither")]["Info"].tolist())
                .rename("Another Pass")

>>> output

ID    Date     
1552  1/4/2020         [LA, BA]
      5/4/2020     [VT, AN, VB]
1697  15/4/2020              []

If you want to merge this back to your original DataFrame to create the new column:

output = df.merge(another_pass, left_on = ["ID", "Date"], right_index=True)

>>> output

      ID       Date  Rank Info Horaire Type Note  Another Pass
0   1552   1/4/2020     1   LQ   10:00    D  LVM      [LA, BA]
1   1552   1/4/2020     1   LQ   10:10    A  LVM      [LA, BA]
2   1552   1/4/2020     2   LA   10:12    P  NaN      [LA, BA]
3   1552   1/4/2020     3   BA   10:15    P  NaN      [LA, BA]
4   1552   1/4/2020     4   DA   10:25    A  LVD      [LA, BA]
5   1552   5/4/2020     1   DT   11:30    D  NaN  [VT, AN, VB]
6   1552   5/4/2020     2   GR   11:33    P  NaN  [VT, AN, VB]
7   1552   5/4/2020     3   LQ   11:35    D  LDT  [VT, AN, VB]
8   1552   5/4/2020     3   LQ   11:38    A  NaN  [VT, AN, VB]
9   1552   5/4/2020     4   VT   11:40    P  NaN  [VT, AN, VB]
10  1552   5/4/2020     5   AN   11:43    P  NaN  [VT, AN, VB]
11  1552   5/4/2020     6   VB   11:46    P  NaN  [VT, AN, VB]
12  1552   5/4/2020     7   DA   11:55    A  LDF  [VT, AN, VB]
13  1552   5/4/2020     7   DA   11:59    D  NaN  [VT, AN, VB]
14  1552   5/4/2020     8   AT   12:15    A  NaN  [VT, AN, VB]
15  1697  15/4/2020     1   HJ   10:00    D  LVM            []
16  1697  15/4/2020     4   OP   11:00    A  LVM            []
Edit:

For the required rows from the output, you can do:

>>> output[((output["Info"].isin(["LQ", "HJ"])) & 
            (output["Type"]=="D")) | 
           ((output["Info"].isin(["DA", "OP"])) & 
            (output["Type"]=="A"))]

      ID       Date  Rank Info Horaire Type Note  Another Pass
0   1552   1/4/2020     1   LQ   10:00    D  LVM      [LA, BA]
4   1552   1/4/2020     4   DA   10:25    A  LVD      [LA, BA]
7   1552   5/4/2020     3   LQ   11:35    D  LDT  [VT, AN, VB]
12  1552   5/4/2020     7   DA   11:55    A  LDF  [VT, AN, VB]
15  1697  15/4/2020     1   HJ   10:00    D  LVM            []
16  1697  15/4/2020     4   OP   11:00    A  LVM            []
  • Related