Home > Mobile >  Choosing the earliest date per record when equal dates are present
Choosing the earliest date per record when equal dates are present

Time:01-30

I have a table with multiple dates per record. Example of the table:

identifier date value
a 1985-01-01 ex1
a 1985-01-01 ex2
a 1985-01-03 ex3
b 1990-01-05 ex1
b 1990-05-10 ex4
c 1987-01-01 ex1
c 1987-01-01 ex3
d 1986-01-01 ex1
d 1986-01-01 ex3

I found out how to extract the earliest date in a group using:

df2 = df.loc[df.groupby('identifier')['date'].idxmin()]

However, when I have two equal dates, as the column value is sorted in alphabetical order, I end up choosing always the first alphabetic value.
I would like to find a way to randomize such behavior whenever I have equal dates, in order to pick:

  • the first value the 1st time
  • the second value the 2nd time
  • the third value (whenever present) the 3rd time

and restart accordingly

Is there a way to use the formula above together with a condition or a randomize method? How can I do that?

Expected output :

identifier date value
a 1985-01-01 ex1
b 1990-01-05 ex1
c 1987-01-01 ex3
d 1986-01-01 ex1

CodePudding user response:

You can filter the rows with the max date, then sample a random one.

Either with two groupby:

out = (df[df['date'].eq(df.groupby('identifier')['date'].transform('min'))]
        .groupby('identifier').sample(n=1)
      )

Or with groupby.apply:

out = (df.groupby('identifier', group_keys=False)
         .apply(lambda g: g[g['date'].eq(g['date'].min())].sample(n=1))
      )

Example output:

  identifier       date value
1          a 1985-01-01   ex2
3          b 1990-01-05   ex1
6          c 1987-01-01   ex3
8          d 1986-01-01   ex3

Other possible output:

  identifier       date value
0          a 1985-01-01   ex1
3          b 1990-01-05   ex1
5          c 1987-01-01   ex1
7          d 1986-01-01   ex1

iterating all possibilities

If you want to iterate on first the first max, then the second, etc. (which is not random):

g1 = df.groupby('identifier')['date']
g2 = df.groupby(['identifier', 'date'])

df2 = (df[df['date'].eq(g1.transform('min'))]
       .assign(n=g2.cumcount())
      )

for n, g in df2.groupby('n'):
    print(f'iteration: {n 1}')
    print(g)

Output:

iteration: 1
  identifier       date value  n
0          a 1985-01-01   ex1  0
3          b 1990-01-05   ex1  0
5          c 1987-01-01   ex1  0
7          d 1986-01-01   ex1  0
iteration: 2
  identifier       date value  n
1          a 1985-01-01   ex2  1
6          c 1987-01-01   ex3  1
8          d 1986-01-01   ex3  1

CodePudding user response:

Another possible solution, with pandas.DataFrame.groupby and pandas.DataFrame.sample :

out = (df.groupby(["identifier", "date"], group_keys=False)
         .apply(lambda s: s.sample(1, random_state=0))
         .sort_values(["identifier", "date"])
         .groupby("identifier", as_index=False).first()
      )

Output :

print(out)

  identifier        date value
0          a  1985-01-01   ex2
1          b  1990-01-05   ex1
2          c  1987-01-01   ex3
3          d  1986-01-01   ex3
  • Related