I have a dataframe that uses a DateTime index. Due to the nature of the data, there are duplicate rows in the raw data that produce duplicate indices. I would like to add a random number of microseconds to each index to make them unique.
So, my index is currently:
2021-06-01 08:58:47 00:00
and I want it to be:
2021-06-01 08:58:47.011356 00:00
I have a column named TimeUTC which I used to create the index. I tried adding a random number of microseconds to it:
df.index = df.index datetime.timedelta(microseconds= random.randrange(0,1000000,1))
but that added the same number of microseconds to all rows. I had the same result when trying to use apply lambda.
- How can I add a random number of microseconds to each row's TimeUTC?
- Can I do this directly to the index instead of updating TimeUTC and then reindexing?
Thank you.
CodePudding user response:
you need to use to_timedelta
with unit='us'
for microseconds, and an array of random number of the same length than the dataframe
# dummy data
df = pd.DataFrame({
'a':range(10),
'TimeUTC':pd.to_datetime(['2021-10-01 00:00:00']*3 ['2021-10-01 00:00:01']*5
['2021-10-01 00:00:02']*2, utc=True)}
)
# set seed for reproducibility
np.random.seed(10)
# set the index from TimeUTC and add microseconds with to_timedelta
df.index = df['TimeUTC'] pd.to_timedelta(np.random.randint(1,1000000,len(df)),unit='us')
print(df)
a TimeUTC
2021-10-01 00:00:00.345354 00:00 0 2021-10-01 00:00:00 00:00
2021-10-01 00:00:00.760958 00:00 1 2021-10-01 00:00:00 00:00
2021-10-01 00:00:00.881168 00:00 2 2021-10-01 00:00:00 00:00
2021-10-01 00:00:01.443713 00:00 3 2021-10-01 00:00:01 00:00
2021-10-01 00:00:01.617842 00:00 4 2021-10-01 00:00:01 00:00
2021-10-01 00:00:01.105596 00:00 5 2021-10-01 00:00:01 00:00
2021-10-01 00:00:01.533661 00:00 6 2021-10-01 00:00:01 00:00
2021-10-01 00:00:01.927706 00:00 7 2021-10-01 00:00:01 00:00
2021-10-01 00:00:02.299742 00:00 8 2021-10-01 00:00:02 00:00
2021-10-01 00:00:02.804337 00:00 9 2021-10-01 00:00:02 00:00
Note that adding random number may not conserve the time order of your dataframe, see in this case for the value a=5, the datetime index is lower than the previous two rows