Home > other >  Listing duplicate strings of a list in another list
Listing duplicate strings of a list in another list

Time:04-21

I have the following issue : I have been asked to write a python script to list every pair of duplicate names.

The problem is that just a part of the string is similar, the last part is numbers (deployement time), for exemple :
asg-lc-crl-tst-turfpari-rtl20220124153420214800000001
asg-lc-crl-tst-turfpari-rtl20220330150836189100000001

Let's say ; I have a list with this 8 values :

(0) -- asg-lc-crl-tst-turfpari-rtl20220124153420214800000001  <--- duplicate with (1)
(1) -- asg-lc-crl-tst-turfpari-rtl20220330150836189100000001  <--- duplicate with (0)
(2) -- asg-lc-dpr-dev1-app_hode-hdh20220420140650975800000001  <--- duplicate with (4)
(3) -- asg-lc-crl-di1-ledger-manager-rtl20220414144111344500000001
(4) -- asg-lc-dpr-dev1-app_hode-hdh20220420143831109200000001  <--- duplicate with (2)
(5) -- asg-lc-crl-tst-art-manager-rtl20220124162240173500000001  <--- duplicate with (6)
(6) -- asg-lc-crl-tst-art-manager-rtl20220330150933020900000001  <--- duplicate with (5)
(9) -- asg-lc-bck-ope-backoh-oh20201021134525920100000001
(8) -- asg-lc-bck-ope-springbootadmin-oh20201021134526042200000002

I have written this code but it is not working properly :

def list_duplicate_asg(asg1, asg2):
   if (asg1.rpartition('-')[0] == asg2.rpartition('-')[0]):
       suffix1 = asg1.rpartition('-')[2]
       suffix2 = asg2.rpartition('-')[2]
 
       if(suffix1[0:3] == suffix2[0:3]):
           print('\n ========== Duplicate exists =========: \n')
           print('   asg1   ','  asg2   '\n ============================ \n')
  

You see, if the values follow each other in the list, they will be printed like the :

  • 0 & 1 : they get printed
  • 5 & 6 : they get printed
  • But for exemple the (2) & (4) doesn't get printed ...

I dont know if my method of parsing is efficient or if there's one much better ?
And how can I improve my code to be able to detect duplicate even if they're not in order ?

I want the result to be like this :

Duplicats : asg-lc-crl-tst-turfpari-rtl20220124153420214800000001,asg-lc-crl-tst-turfpari-rtl20220330150836189100000001 
Duplicats : asg-lc-dpr-dev1-app_hode-hdh20220420140650975800000001,asg-lc-dpr-dev1-app_hode-hdh20220420143831109200000001 
Duplicats : asg-lc-crl-tst-art-manager-rtl20220124162240173500000001,asg-lc-crl-tst-art-manager-rtl20220330150933020900000001

CodePudding user response:

The data shown appears to be made up of at least 3 whitespace delimited tokens per line. The 3rd token is of interest. The timestamp begins after the last occurrence of a hyphen. Therefore:

asgs = ['asg-lc-crl-tst-turfpari-rtl20220124153420214800000001',
        'asg-lc-crl-tst-turfpari-rtl20220330150836189100000001',
        'asg-lc-dpr-dev1-app_hode-hdh20220420140650975800000001',
        'asg-lc-crl-di1-ledger-manager-rtl20220414144111344500000001',
        'asg-lc-dpr-dev1-app_hode-hdh20220420143831109200000001',
        'asg-lc-crl-tst-art-manager-rtl20220124162240173500000001',
        'asg-lc-crl-tst-art-manager-rtl20220330150933020900000001',
        'asg-lc-bck-ope-backoh-oh20201021134525920100000001',
        'asg-lc-bck-ope-springbootadmin-oh20201021134526042200000002']

counter = dict()

for asg in asgs:
    k = asg[:asg.rfind('-')]
    counter[k] = counter.setdefault(k, 0)   1
    if counter[k] == 2:
        print(k)

Obviously this will need to be adapted according to how the data are actually made available to your program. The output from this will be:

asg-lc-crl-tst-turfpari
asg-lc-dpr-dev1-app_hode
asg-lc-crl-tst-art-manager

CodePudding user response:

This should work. The strings first get stripped of their timestamp suffix and then registered in a dict (we call that cleaned string "key" for now). The dict keeps track of all the keys that have been found so far. When a key is already known, a duplicate dictionary is filled. The duplicates dict has a list of all duplicates for each key.

import re

asgs = ['asg-lc-crl-tst-turfpari-rtl20220124153420214800000001',
     'asg-lc-crl-tst-turfpari-rtl20220330150836189100000001',
     'asg-lc-dpr-dev1-app_hode-hdh20220420140650975800000001',
     'asg-lc-crl-di1-ledger-manager-rtl20220414144111344500000001',
     'asg-lc-dpr-dev1-app_hode-hdh20220420143831109200000001',
     'asg-lc-crl-tst-art-manager-rtl20220124162240173500000001',
     'asg-lc-crl-tst-art-manager-rtl20220330150933020900000001',
     'asg-lc-bck-ope-backoh-oh20201021134525920100000001',
     'asg-lc-bck-ope-springbootadmin-oh20201021134526042200000002']


def get_duplicate_asgs(asgs: list):
    asgs_found = {}
    duplicates = {}
    for asg in asgs:
        asg_cleaned = asg[0:-26]
        # alternative solution for time stamps of different length:
        # asg_cleaned = re.sub("[0-9] $", "", asg)
        if asg_cleaned in asgs_found:
            if asg_cleaned in duplicates:
                duplicates[asg_cleaned].append(asg)
            else:
                duplicates[asg_cleaned] = [asgs_found[asg_cleaned], asg, ]
        else:
            asgs_found[asg_cleaned] = asg
    return duplicates.values()

print(get_duplicate_asgs(asgs))
  • Related