How to create dictionary to look for dropped zeros?-CodePudding

I ran into this specific problem where I have a dataframe of ID numbers. Some of these account numbers have dropped leading zeros. dataframe is df.

So what im trying to do is create a generalized way to check if we have dropped leading zeros. So basically, in my real data set there would be millions of rows. So I want to use a pandas method to say if there is a section of ID that matches a section with the zeros to put that into another dataframe so I can further examine.

I do that like this:

new_df = df.loc[df['ID'].isin(df['ID'])]

My reasoning for this is that I want to filter that dataset to find if any of the IDs are inside the full IDs.

Now I have

I can use a .unique() to get a series of each unique combo.

This is fine for a small dataset. But for rows of millions, I am wondering how I can make it easier to do this check.

I trying to find a way to create a dictionary where the keys are the 3 digit and the values are its full ID. or vice versa. Any tips on that would be appreciated. If anyone has any tips also on a different idea to checking for dropped zeros, other than the dictionary approach, that would be helpful too.

Note: It is not always 3 digits. Could be 4567 for example, where the real value would be 004567.

CodePudding user response：

One option is to strip leading "0"s:

out = df['ID'].str.lstrip('0').unique()

Output:

array(['345', '543', '922'], dtype=object)

or prepend "0"s:

out = df['ID'].str.zfill(df['ID'].str.len().max()).unique()

Output:

array(['000345', '000543', '000922'], dtype=object)

CodePudding user response：

Use:

print (df)
       ID
0     345
1     345
2     540
3    2922
4  002922
5  000344
6  000345
7  000543

#filter ID starting by 0 to Series
d = df.loc[df['ID'].str.startswith('0'), 'ID']
#create index in Series with remove zeros from left side
d.index = d.str.lstrip('0')
print (d)
ID
2922    002922
344     000344
345     000345
543     000543
Name: ID, dtype: object

#dict all possible values
print (d.to_dict())
{'2922': '002922', '344': '000344', '345': '000345', '543': '000543'}

#compare if exist indices in original ID column and create dict
d = d[d.index.isin(df['ID'])].to_dict()
print (d)
{'2922': '002922', '345': '000345', '543': '000543'}

CodePudding user response：

You can convert the column type to int

m = df['ID'].ne(df['ID'].astype(int))

print(m)

0    False
1    False
2    False
3     True
4     True
5     True
Name: ID, dtype: bool

print(df[m])

       ID
3  000345
4  000345
5  000543