How to check if an element in a column in a pandas data frame occurs twice or more than twice in tha-CodePudding

I have the following pandas data frame. I am trying to write a code which checks if the elements of the column "to_address" occurs twice or more than twice in that column.

I am writing the code in such a manner that if there is a match in the column, then it will be excluded in the subsequent iterations so that there is no case of duplicity. For that, I have added a "flag_column", which is populated with 1. Unless a match occurs, the value will remain 1. And if a match occurs, the value in that row of the "flag_column" will change to zero. If in the subsequent iterations a row having "flag_column"==0 is encountered, the calculation would be skipped.

n=0
m=0
for i in range(df_containing_nft.shape[0]): # or len(df_containing_nft)
    Text_to_search=df_containing_nft.iat[i,2]
    if(df_containing_nft.iat[i,3]==0):
        m=m 1
        continue
    for j in range(i 1, df_containing_nft.shape[0]):        
        if (df_containing_nft.iat[j,2]==Text_to_search):            
            n=n 1            
            print(Text_to_search)
            print(df_containing_nft.iloc[i])
            print(df_containing_nft.iloc[j])
            df_containing_nft.at[j, 3] = 0 # Setting flag equal to zero. Row already visited           
            print("------------------------------------------------------------------")
            
print(n)
print(m)

The problem that I am facing is that the no. of result outcomes is around 67500, which is about 20 times the number of rows present in the data frame which is 3177 (I arrived at this by printing n), which is odd. Also, the idea of setting flag ==0 and skipping the calculations doesn't seem to be working because m turns out to be zero.

Am I doing some mistake in my calculation, or if this is indeed the correct way to arrive at the result, this I am unable to understand. Please help me solve my problem.

CodePudding user response：

The duplicated method returns True if duplicated rows are found in your data. Tilde returns the opposite (False). Applying int will make it 0 and 1 values

df['flag_column'] = (~df['to_address'].duplicated()).apply(int)

CodePudding user response：

This code drops duplicates from pandas data frame based on column:

import pandas as pd

# Replace this with your dataframe
df = pd.DataFrame([{"name": "apple",  "fruit": True,  "vegetable": False, "value": 1},
                   {"name": "banana", "fruit": True,  "vegetable": False, "value": 2},
                   {"name": "carrot", "fruit": False, "vegetable": True, "value": 1}])

print(df) # My dataframe
#      name  fruit  vegetable  value
# 0   apple   True      False      1
# 1  banana   True      False      2
# 2  carrot  False       True      1

df = df.drop_duplicates(subset="value") # Change column name to yours

# Print new dataframe
print(df)
#      name  fruit  vegetable  value
# 0   apple   True      False      1
# 1  banana   True      False      2