I have the following issue : I have been asked to write a python script to list every pair of duplicate names.
The problem is that just a part of the string is similar, the last part is numbers (deployement time), for exemple :
asg-lc-crl-tst-turfpari-rtl20220124153420214800000001
asg-lc-crl-tst-turfpari-rtl20220330150836189100000001
Let's say ; I have a list with this 8 values :
(0) -- asg-lc-crl-tst-turfpari-rtl20220124153420214800000001 <--- duplicate with (1)
(1) -- asg-lc-crl-tst-turfpari-rtl20220330150836189100000001 <--- duplicate with (0)
(2) -- asg-lc-dpr-dev1-app_hode-hdh20220420140650975800000001 <--- duplicate with (4)
(3) -- asg-lc-crl-di1-ledger-manager-rtl20220414144111344500000001
(4) -- asg-lc-dpr-dev1-app_hode-hdh20220420143831109200000001 <--- duplicate with (2)
(5) -- asg-lc-crl-tst-art-manager-rtl20220124162240173500000001 <--- duplicate with (6)
(6) -- asg-lc-crl-tst-art-manager-rtl20220330150933020900000001 <--- duplicate with (5)
(9) -- asg-lc-bck-ope-backoh-oh20201021134525920100000001
(8) -- asg-lc-bck-ope-springbootadmin-oh20201021134526042200000002
I have written this code but it is not working properly :
def list_duplicate_asg(asg1, asg2):
if (asg1.rpartition('-')[0] == asg2.rpartition('-')[0]):
suffix1 = asg1.rpartition('-')[2]
suffix2 = asg2.rpartition('-')[2]
if(suffix1[0:3] == suffix2[0:3]):
print('\n ========== Duplicate exists =========: \n')
print(' asg1 ',' asg2 '\n ============================ \n')
You see, if the values follow each other in the list, they will be printed like the :
- 0 & 1 : they get printed
- 5 & 6 : they get printed
- But for exemple the (2) & (4) doesn't get printed ...
I dont know if my method of parsing is efficient or if there's one much better ?
And how can I improve my code to be able to detect duplicate even if they're not in order ?
I want the result to be like this :
Duplicats : asg-lc-crl-tst-turfpari-rtl20220124153420214800000001,asg-lc-crl-tst-turfpari-rtl20220330150836189100000001
Duplicats : asg-lc-dpr-dev1-app_hode-hdh20220420140650975800000001,asg-lc-dpr-dev1-app_hode-hdh20220420143831109200000001
Duplicats : asg-lc-crl-tst-art-manager-rtl20220124162240173500000001,asg-lc-crl-tst-art-manager-rtl20220330150933020900000001
CodePudding user response:
The data shown appears to be made up of at least 3 whitespace delimited tokens per line. The 3rd token is of interest. The timestamp begins after the last occurrence of a hyphen. Therefore:
asgs = ['asg-lc-crl-tst-turfpari-rtl20220124153420214800000001',
'asg-lc-crl-tst-turfpari-rtl20220330150836189100000001',
'asg-lc-dpr-dev1-app_hode-hdh20220420140650975800000001',
'asg-lc-crl-di1-ledger-manager-rtl20220414144111344500000001',
'asg-lc-dpr-dev1-app_hode-hdh20220420143831109200000001',
'asg-lc-crl-tst-art-manager-rtl20220124162240173500000001',
'asg-lc-crl-tst-art-manager-rtl20220330150933020900000001',
'asg-lc-bck-ope-backoh-oh20201021134525920100000001',
'asg-lc-bck-ope-springbootadmin-oh20201021134526042200000002']
counter = dict()
for asg in asgs:
k = asg[:asg.rfind('-')]
counter[k] = counter.setdefault(k, 0) 1
if counter[k] == 2:
print(k)
Obviously this will need to be adapted according to how the data are actually made available to your program. The output from this will be:
asg-lc-crl-tst-turfpari
asg-lc-dpr-dev1-app_hode
asg-lc-crl-tst-art-manager
CodePudding user response:
This should work. The strings first get stripped of their timestamp suffix and then registered in a dict (we call that cleaned string "key" for now). The dict keeps track of all the keys that have been found so far. When a key is already known, a duplicate dictionary is filled. The duplicates dict has a list of all duplicates for each key.
import re
asgs = ['asg-lc-crl-tst-turfpari-rtl20220124153420214800000001',
'asg-lc-crl-tst-turfpari-rtl20220330150836189100000001',
'asg-lc-dpr-dev1-app_hode-hdh20220420140650975800000001',
'asg-lc-crl-di1-ledger-manager-rtl20220414144111344500000001',
'asg-lc-dpr-dev1-app_hode-hdh20220420143831109200000001',
'asg-lc-crl-tst-art-manager-rtl20220124162240173500000001',
'asg-lc-crl-tst-art-manager-rtl20220330150933020900000001',
'asg-lc-bck-ope-backoh-oh20201021134525920100000001',
'asg-lc-bck-ope-springbootadmin-oh20201021134526042200000002']
def get_duplicate_asgs(asgs: list):
asgs_found = {}
duplicates = {}
for asg in asgs:
asg_cleaned = asg[0:-26]
# alternative solution for time stamps of different length:
# asg_cleaned = re.sub("[0-9] $", "", asg)
if asg_cleaned in asgs_found:
if asg_cleaned in duplicates:
duplicates[asg_cleaned].append(asg)
else:
duplicates[asg_cleaned] = [asgs_found[asg_cleaned], asg, ]
else:
asgs_found[asg_cleaned] = asg
return duplicates.values()
print(get_duplicate_asgs(asgs))