I have a table with multiple dates per record. Example of the table:
identifier | date | value |
---|---|---|
a | 1985-01-01 | ex1 |
a | 1985-01-01 | ex2 |
a | 1985-01-03 | ex3 |
b | 1990-01-05 | ex1 |
b | 1990-05-10 | ex4 |
c | 1987-01-01 | ex1 |
c | 1987-01-01 | ex3 |
d | 1986-01-01 | ex1 |
d | 1986-01-01 | ex3 |
I found out how to extract the earliest date in a group using:
df2 = df.loc[df.groupby('identifier')['date'].idxmin()]
However, when I have two equal dates, as the column value is sorted in alphabetical order, I end up choosing always the first alphabetic value.
I would like to find a way to randomize such behavior whenever I have equal dates, in order to pick:
- the first value the 1st time
- the second value the 2nd time
- the third value (whenever present) the 3rd time
and restart accordingly
Is there a way to use the formula above together with a condition or a randomize method? How can I do that?
Expected output :
identifier | date | value |
---|---|---|
a | 1985-01-01 | ex1 |
b | 1990-01-05 | ex1 |
c | 1987-01-01 | ex3 |
d | 1986-01-01 | ex1 |
CodePudding user response:
You can filter the rows with the max date, then sample
a random one.
Either with two groupby
:
out = (df[df['date'].eq(df.groupby('identifier')['date'].transform('min'))]
.groupby('identifier').sample(n=1)
)
Or with groupby.apply
:
out = (df.groupby('identifier', group_keys=False)
.apply(lambda g: g[g['date'].eq(g['date'].min())].sample(n=1))
)
Example output:
identifier date value
1 a 1985-01-01 ex2
3 b 1990-01-05 ex1
6 c 1987-01-01 ex3
8 d 1986-01-01 ex3
Other possible output:
identifier date value
0 a 1985-01-01 ex1
3 b 1990-01-05 ex1
5 c 1987-01-01 ex1
7 d 1986-01-01 ex1
iterating all possibilities
If you want to iterate on first the first max, then the second, etc. (which is not random):
g1 = df.groupby('identifier')['date']
g2 = df.groupby(['identifier', 'date'])
df2 = (df[df['date'].eq(g1.transform('min'))]
.assign(n=g2.cumcount())
)
for n, g in df2.groupby('n'):
print(f'iteration: {n 1}')
print(g)
Output:
iteration: 1
identifier date value n
0 a 1985-01-01 ex1 0
3 b 1990-01-05 ex1 0
5 c 1987-01-01 ex1 0
7 d 1986-01-01 ex1 0
iteration: 2
identifier date value n
1 a 1985-01-01 ex2 1
6 c 1987-01-01 ex3 1
8 d 1986-01-01 ex3 1
CodePudding user response:
Another possible solution, with pandas.DataFrame.groupby
and pandas.DataFrame.sample
:
out = (df.groupby(["identifier", "date"], group_keys=False)
.apply(lambda s: s.sample(1, random_state=0))
.sort_values(["identifier", "date"])
.groupby("identifier", as_index=False).first()
)
Output :
print(out)
identifier date value
0 a 1985-01-01 ex2
1 b 1990-01-05 ex1
2 c 1987-01-01 ex3
3 d 1986-01-01 ex3