How to indicate leading zero differences row wise?-CodePudding

I have this as the ID numbers: "00456, 0000456, 567, 00567" in a dataframe called "test".

I created a dataframe where it has the IDs with leading zeros in a column called ID, and a left strip version of the ID in a column named stripped. and an indicator column coded like this:

df.loc[df['stripped'].isin(test['ID'], 'indicator']=1

This tells me if the left stripped version of the ID is in the test df. So for 00567, it would indicate 1, since 567 is in the test dataframe. And for 456 it would indicator nan, since 456 is not in test.

I am trying to find a way to indicate for leading zero differences. So I want an indicator column to indicate yes that 00456 and 0000456 are the same with leading zero differences,and let me access that info easily. I am a bit at a loss for how to do this, I tried groupby, but because they are not matches it wasnt working. Any tips would be appreciated.

ID        stripped  indicator
00456       456        1
0000456     456        1

Currently, the output is that the indicator column is nan for rows shown above. But I want it to be what is shown above, indicator = 1. Showing that theres leading zero differences.

I am not sure how to compare row by row, and by specific ID

Updated Question: How do I code a way of comparing the semi matching IDs, 00456, 0000456. and indicating a leading zero difference. I am thinking a .str.contains, but I am not sure how to group that to only the same ID numbers.

CodePudding user response：

Are either of these helpful to what you want to do?

Given a column of ids:

         id
0     00456
1   0000456
2       456
3     00345
4       345
5  00000345

Doing:

df['id'].groupby(df.id.str.lstrip('0')).agg(list)

Output:

id
345    [00345, 345, 00000345]
456     [00456, 0000456, 456]
Name: id, dtype: object

Or doing:

df['all'] = df['id'].groupby(df.id.str.lstrip('0')).transform(lambda x: [x.tolist()]*len(x))
# or
df['all'] = df['id'].groupby(df.id.str.lstrip('0')).transform(' '.join).str.split()

Output:

         id                     all
0     00456   [00456, 0000456, 456]
1   0000456   [00456, 0000456, 456]
2       456   [00456, 0000456, 456]
3     00345  [00345, 345, 00000345]
4       345  [00345, 345, 00000345]
5  00000345  [00345, 345, 00000345]

CodePudding user response：

IIUC you need an or condition to look both in ID, stripped.

df.loc[((df['stripped'].isin(test['ID']) | (df['ID'].isin(test['ID'])), 'indicator']=1