There are two columns in my df, the second column includes data of the other column other characters (alphabets and/or numbers):
values = {
'number': [2830, 8457, 9234],
'nums': ['2830S', '8457M', '923442']
}
df = pd.DataFrame(values, columns=['number', 'nums'])
The extra characters are always after the common characters! How can I separate the characters that are not common between the two columns? I am looking for a simple solution, not a loop to check every character.
CodePudding user response:
Replace common characters by empty string:
f_diff = lambda x: x['nums'].replace(x['number'], '')
df['extra'] = df[['number', 'nums']].astype(str).apply(f_diff, axis=1)
print(df)
# Output
number nums extra
0 2830 2830S S
1 8457 8457M M
2 9234 923442 42
Update
If number
values are always the first characters of nums
column, you can use a simpler function:
f_diff2 = lambda x: x['nums'][len(x['number']):]
df['extra'] = df[['number', 'nums']].astype(str).apply(f_diff2, axis=1)
print(df)
# Output
# Output
number nums extra
0 2830 2830S S
1 8457 8457M M
2 9234 923442 42
CodePudding user response:
I would delete the prefix of the string. For this you can the method apply()
to apply following function on each row:
def remove_prefix(text, prefix):
if text.startswith(prefix):
return text[len(prefix):]
return text
df['nums'] = df.apply(lambda x: remove_prefix(x['nums'], str(x['number'])), axis=1)
df
Output:
number nums
0 2830 S
1 8457 M
2 9234 42
If you have python version >= 3.9 you only need this:
df['nums'] = df.apply(lambda x: x['nums'].removeprefix(x['number']), axis=1)