I have an xlsx
file containing too much data. however the data contains duplicate
values in column named UniversalIDS
which I wanted to replace it with a randomly generated IDS
with Pandas
.
So far I've tried different scenarios which I googled but did not work. for example I tried this:
import pandas as pd
import uuid
df = pd.read_excel('C:/Users/Nervous/Downloads/ss.xlsx')
df.loc[df.duplicated(subset=["UniversalIDS"]), "UniversalIDS"] = uuid.uuid4()
df.to_excel("C:/Users/Nervous/Downloads/tt.xlsx")
print("done!")
also I tried other alternatives seen on this site like for example:
df2 = df.assign(UniversalIDS=df['UniversalIDS'].where(
~df.duplicated(['UniversalIDS']), uuid.uuid4()))
also this didn't work:
df.loc[df["UniversalIDS"].duplicated(), "test"]
this is a snippet from xlsx data
:
UniversalIDS
f6112cd7-0868-4cc9-b5ab-d7381cc23bdf
f3e75641-f328-429f-ae32-41399d8a0bf0
08dccc5c-5774-4614-925c-ad9373821075
79a8ebed-154c-47c7-b16d-cbba5d8469eb
396f8e63-1950-4c36-9524-c1dec8728ffd
62cba3bd-a855-4141-8460-7ff838ecea62
b7f4f753-b479-413a-abcd-34d08436eb85
c0fd6e61-edb1-4dce-94ac-7d0529860e1f
c42f8c98-c285-4552-af9f-2f4d8e89b9e8
8cb77021-eb4f-4cfa-a4a3-e649f6a53c03
cb7f4b8d-976a-4481-919e-c8ff7d384cc6
e15fd2bb-5409-4a8b-9fdc-bf1e5862db58
27b97893-aae7-4a9a-aae1-0fc21a099209
1abc2c2f-94f2-4546-b912-b85bc4ed0cb8
6bf264fb-1b82-48e3-966a-14e48a61a63e
9653faeb-7b3d-408e-93e3-bc729f259c75
a09f3eb6-0059-4a77-bf2f-4f7436508ba8
65e06948-2e6c-413f-a768-c3faf8108a6c
291ff491-4ff0-4fb2-b095-b3e66f2d7ca0
653535c7-0389-4077-8e72-3835fbd72d4d
61408fc8-4f45-48e0-b83a-40b6bfd76ad5
3ae8d547-bf4b-42ac-b083-a1662f1a5c82
4955c673-c5da-464c-8e14-a897df0774eb
a39bad90-5235-4679-945e-534bb47b8347
264a1f6e-adf4-45a7-b1b1-e6f3fc073447
a855025b-ee84-46d5-aedb-cbac9a5b1920
71b16a5b-3f6d-4d30-8a65-203959fe87a2
4f3f86f2-4e61-475a-bc1d-eb2112f23953
59da45de-c192-4885-8a55-9138ca49b33a
8f41df73-d9dc-4663-9f64-d090d7c5ca77
84f7103f-e9de-444f-b046-c02d75af0ed1
2738f733-7438-494c-9368-5fb700df93d1
777a3cd7-19ae-4181-b91d-9be8eaf30523
b6083731-a43e-4b5a-ac9a-94a3202103e7
f22873c1-6811-4025-8f0d-47d72d49e499
f262c369-f44a-4b90-8219-d29b33bc14e8
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
ea0d26f5-d8c4-4082-983a-8eeea29c6c54
as can be seen in the above there are a duplicate values in UniversalIDS
column, also it is worth to mention that there are other columns in the data but cut out the problem causing column for simplicity.
so my question is how can I replace the duplicate values in UniversalIDS
column with a new unique IDs?
CodePudding user response:
Your expression:
df.loc[df.duplicated(subset=["UniversalIDS"]), "UniversalIDS"] = uuid.uuid4()
is correct python but it sets one uuid for all duplicated elements, which means the elements wil still be duplicated after it's been executed. You should create a Series with distinct uuids:
df.loc[df.duplicated(subset=["UniversalIDS"]), "UniversalIDS"] = pd.Series([uuid.uuid4() for _ in df.duplicated(subset=["UniversalIDS"])])
CodePudding user response:
I found basically the same as the previous answer when I reproduced your problem but came up with a slightly different solution so posting for reference:
for idx, _ in df[df.duplicated(subset=["UniversalIDS"])].iterrows():
df.at[idx, 'UniversalIDS'] = uuid.uuid4()