How can I identify two substrings from a list that make up one unique substring from a larger list i-CodePudding

I have a large list of unique strings (~1000), for instance: [bbbhbbbh, jjjhhssa, eeeffus,...]

And a smaller list of sub-string pairs (~50) that make up each of these unique strings: [bbbh, jjjh, hssa, eeef, fus,...]

I want to create a function that takes the large unique string list (~1000) as an argument and returns a dictionary with the unique string and the corresponding values of its two unique sub-strings.

For example:

result = {'bbbhbbbh': 'bbbh/bbbh', 
            'jjjhhssa': 'jjjh/hssa', 
            'eeeffus': 'eeef/fus',...}

I've tried with a for loop but I am not able to print the unique strings with duplicates, I am wondering if there is a more concise way with list comprehension along with returning the two corresponding values that make up the unique string? I only want to use the json package at this point and solve this without importing any new packages. Thank you for any help with this.

My current loop and output:

result = []    

for string in pair_list:
    matches = []
    for substring in sub_list:
        if substring in string:
            matches.append(substring)
    if matches:
        result.append(matches)

print(result)

[['bbbh'], ['jjjh', 'hssa'], ['eeef', 'fus'],...

CodePudding user response：

Previously in the code it was not searching for its duplicate in the list items. Once it gets the desired substring in the main string it passes onto another. But now its finding multiple duplicate through reggex finditer method.

pair_list= ['bbbhbbbh', 'jjjhhssa', 'eeeffus', 'aaaabbbh', 'ccccdddd','eeefff']

sub_list = ['bbbh', 'jjjh', 'hssa', 'eeef', 'fus']

import re
result = []    
for string in pair_list:
    matches = []
    for substring in sub_list:
        for duplicate in re.finditer(substring, string):
            matches.append(substring)
    if matches:
        result.append(matches)

print(result)

I hope this might help you.

CodePudding user response：

AS per your output format I think you are expecting a dictionary kind of object. Where Long string is Key and all matched sub string is value. Just modifying your code, I added a dict object to store the result and append the sub string to the values.

Code:

    pair_list = ["bbbhbbbh", "jjjhhssa", "eeeffus"]
sub_list = ["bbbh", "jjjh", "hssa", "eeef", "fus"]
pair_mapping_result = dict() 

for pair_string in pair_list:
    for sub_string in sub_list:
        if sub_string in pair_string:
            pair_mapping_result[pair_string] = (r"{}/{}".format(pair_mapping_result[pair_string], 
                                                               sub_string) 
                                                if pair_mapping_result.get(pair_string) else sub_string)

print(pair_mapping_result)

Output

{'bbbhbbbh': 'bbbh', 'jjjhhssa': 'jjjh/hssa', 'eeeffus': 'eeef/fus'}

We can do it using a dict comprehension

Code

{pair_string: "/".join([sub_string 
                        for sub_string in sub_list \
                        if sub_string in pair_string]) \
                        for pair_string in pair_list}