I am trying to extract letters from a string column in pandas. This is used for identifying the type of data I am looking at.
I want to take a column of:
GE5000341
R22256544443
PDalirnamm
AAddda
and create a new column of:
GE
R
PDALIRN
AD
CodePudding user response:
untill you explain it better , here is the closest thing to what you want :
df['letters'] = df['code'].str.extract(r'([a-zA-Z]*)')
df['letters'] = df['letters'].str.upper().apply(lambda x: ''.join(sorted(set(x), key=x.index)))
print(df)
output:
code letters
0 GE5000341 GE
1 R22256544443 R
2 PDalirnamm PDALIRNM
3 AAddda AD
CodePudding user response:
Assuming you want to get all letters until the first duplicated letter:
# example DataFrame
df = pd.DataFrame({'col': ['GE5000341', 'R22256544443', 'PDalirnamm', 'AAddda']})
# keep only until first duplicated letter
df = pd.DataFrame({'col': ['GE5000341', 'R22256544443', 'PDalirnamm', 'AAddda']})
def untildup(s):
out = []
seen = set()
for x in s.upper():
if x in seen or not x.isalpha():
return ''.join(out)
out.append(x)
seen.add(x)
df['out'] = [untildup(s) for s in df['col']]
print(df)
Output:
col out
0 GE5000341 GE
1 R22256544443 R
2 PDalirnamm PDALIRN
3 AAddda A
If you want to keep the unique letters in order:
df['out'] = [''.join(dict.fromkeys(x.upper() for x in s if x.isalpha()))
for s in df['col']]
Output:
col out
0 GE5000341 GE
1 R22256544443 R
2 PDalirnamm PDALIRNM
3 AAddda AD