Drop Duplicate Substrings from String with NO Spaces-CodePudding

Given a Pandas DF column that looks like this:

...how can I turn it into this:

XOM
ZM
AAPL
SOFI
NKLA
TIGR

Although these strings appear to be 4 characters in length maximum, I can't rely on that, I want to be able to have a string like ABCDEFGHIJABCDEFGHIJ and still be able to turn it into ABCDEFGHIJ in one column calculation. Preferably WITHOUT for looping/iterating through the rows.

CodePudding user response：

You can use regex pattern like r'\b(\w )\1\b' with str.extract like below:

df = pd.DataFrame({'Symbol':['ZOMZOM', 'ZMZM', 'SOFISOFI', 
                             'ABCDEFGHIJABCDEFGHIJ', 'NOTDUPLICATED']})
print(df['Symbol'].str.extract(r'\b(\w )\1\b'))

Output:

           0
0         ZOM
1          ZM
2        SOFI
3  ABCDEFGHIJ
4         NaN # <- from `NOTDUPLICATED`

Explanation:

\b is a word boundary
(w ) capture a word
\1 references to captured (w ) of the first group

CodePudding user response：

An alternative approach which does involve iteration, but also regular expressions. Evaluate longest possible substrings first, getting progressively shorter. Use the substring to compile a regex that looks for the substring repeated two or more times. If it finds that, replace it with a single occurrence of the substring.

Does not handle leading or trailing characters. that are not part of the repetition.

When it performs a removal, it returns, breaking the loop. Going with longest substrings first ensures things like 'AAPLAAPL' leave the double A intact.

import re

def remove_repeated(str):
    for i in range(len(str)):
        substr = str[i:]
        pattern = re.compile(f"({substr}){{2,}}")
        if pattern.search(str):
            return pattern.sub(substr, str)
    return str

>>> remove_repeated('abcdabcd')
'abcd'
>>> remove_repeated('abcdabcdabcd')
'abcd'
>>> remove_repeated('aabcdaabcdaabcd')
'aabcd'

If we want to make this more flexible, a helper function to get all of the substrings in a string, starting with the longest, but as a generator expression so we don't have to actually generate more than we need.

def substrings(str):
    return (str[i:i l] for l in range(len(str), 0, -1) 
                       for i in range(len(str) - l   1))

>>> list(substrings("hello"))
['hello', 'hell', 'ello', 'hel', 'ell', 'llo', 'he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']

But there's no way 'hello' is going to be repeated in 'hello', so we can make this at least somewhat more efficient by looking at only substrings at most half the length of the input string.

def substrings(str):
    return (str[i:i l] for l in range(len(str)//2, 0, -1) 
                       for i in range(len(str) - l   1))

>>> list(substrings("hello"))
['he', 'el', 'll', 'lo', 'h', 'e', 'l', 'l', 'o']

Now, a little tweak to the original function:

def remove_repeated(str):
    for s in substrings(str):
        pattern = re.compile(f"({s}){{2,}}")
        if pattern.search(str):
            return pattern.sub(s, str)
    return str

And now:

>>> remove_repeated('AAPLAAPL')
'AAPL'
>>> remove_repeated('fooAAPLAAPLbar')
'fooAAPLbar'