How to extract letters from a string (object) column in pandas dataframe-CodePudding

I am trying to extract letters from a string column in pandas. This is used for identifying the type of data I am looking at.

I want to take a column of:

GE5000341

R22256544443

PDalirnamm

AAddda

and create a new column of:

GE

R

PDALIRN

AD

CodePudding user response：

untill you explain it better , here is the closest thing to what you want :

df['letters'] = df['code'].str.extract(r'([a-zA-Z]*)')
df['letters'] = df['letters'].str.upper().apply(lambda x: ''.join(sorted(set(x), key=x.index)))
print(df)

output:

           code   letters
0     GE5000341        GE
1  R22256544443         R
2    PDalirnamm  PDALIRNM
3        AAddda        AD

CodePudding user response：

Assuming you want to get all letters until the first duplicated letter:

# example DataFrame 
df = pd.DataFrame({'col': ['GE5000341', 'R22256544443', 'PDalirnamm', 'AAddda']})

# keep only until first duplicated letter
df = pd.DataFrame({'col': ['GE5000341', 'R22256544443', 'PDalirnamm', 'AAddda']})

def untildup(s):
    out = []
    seen = set()
    for x in s.upper():
        if x in seen or not x.isalpha():
            return ''.join(out)
        out.append(x)
        seen.add(x)
       

df['out'] = [untildup(s) for s in df['col']]

print(df)

Output:

            col      out
0     GE5000341       GE
1  R22256544443        R
2    PDalirnamm  PDALIRN
3        AAddda        A

If you want to keep the unique letters in order:

df['out'] = [''.join(dict.fromkeys(x.upper() for x in s if x.isalpha()))
             for s in df['col']]

Output:

            col       out
0     GE5000341        GE
1  R22256544443         R
2    PDalirnamm  PDALIRNM
3        AAddda        AD