I have this as the ID numbers: "00456, 0000456, 567, 00567" in a dataframe called "test".
I created a dataframe where it has the IDs with leading zeros in a column called ID, and a left strip version of the ID in a column named stripped. and an indicator column coded like this:
df.loc[df['stripped'].isin(test['ID'], 'indicator']=1
This tells me if the left stripped version of the ID is in the test df. So for 00567, it would indicate 1, since 567 is in the test dataframe. And for 456 it would indicator nan, since 456 is not in test.
I am trying to find a way to indicate for leading zero differences. So I want an indicator column to indicate yes that 00456 and 0000456 are the same with leading zero differences,and let me access that info easily. I am a bit at a loss for how to do this, I tried groupby, but because they are not matches it wasnt working. Any tips would be appreciated.
ID stripped indicator
00456 456 1
0000456 456 1
Currently, the output is that the indicator column is nan for rows shown above. But I want it to be what is shown above, indicator = 1. Showing that theres leading zero differences.
I am not sure how to compare row by row, and by specific ID
Updated Question: How do I code a way of comparing the semi matching IDs, 00456, 0000456. and indicating a leading zero difference. I am thinking a .str.contains, but I am not sure how to group that to only the same ID numbers.
CodePudding user response:
Are either of these helpful to what you want to do?
Given a column of ids:
id
0 00456
1 0000456
2 456
3 00345
4 345
5 00000345
Doing:
df['id'].groupby(df.id.str.lstrip('0')).agg(list)
Output:
id
345 [00345, 345, 00000345]
456 [00456, 0000456, 456]
Name: id, dtype: object
Or doing:
df['all'] = df['id'].groupby(df.id.str.lstrip('0')).transform(lambda x: [x.tolist()]*len(x))
# or
df['all'] = df['id'].groupby(df.id.str.lstrip('0')).transform(' '.join).str.split()
Output:
id all
0 00456 [00456, 0000456, 456]
1 0000456 [00456, 0000456, 456]
2 456 [00456, 0000456, 456]
3 00345 [00345, 345, 00000345]
4 345 [00345, 345, 00000345]
5 00000345 [00345, 345, 00000345]
CodePudding user response:
IIUC you need an or condition to look both in ID, stripped.
df.loc[((df['stripped'].isin(test['ID']) | (df['ID'].isin(test['ID'])), 'indicator']=1