I have the following pandas data frame. I am trying to write a code which checks if the elements of the column "to_address" occurs twice or more than twice in that column.
I am writing the code in such a manner that if there is a match in the column, then it will be excluded in the subsequent iterations so that there is no case of duplicity. For that, I have added a "flag_column", which is populated with 1. Unless a match occurs, the value will remain 1. And if a match occurs, the value in that row of the "flag_column" will change to zero. If in the subsequent iterations a row having "flag_column"==0 is encountered, the calculation would be skipped.
n=0
m=0
for i in range(df_containing_nft.shape[0]): # or len(df_containing_nft)
Text_to_search=df_containing_nft.iat[i,2]
if(df_containing_nft.iat[i,3]==0):
m=m 1
continue
for j in range(i 1, df_containing_nft.shape[0]):
if (df_containing_nft.iat[j,2]==Text_to_search):
n=n 1
print(Text_to_search)
print(df_containing_nft.iloc[i])
print(df_containing_nft.iloc[j])
df_containing_nft.at[j, 3] = 0 # Setting flag equal to zero. Row already visited
print("------------------------------------------------------------------")
print(n)
print(m)
The problem that I am facing is that the no. of result outcomes is around 67500, which is about 20 times the number of rows present in the data frame which is 3177 (I arrived at this by printing n), which is odd. Also, the idea of setting flag ==0 and skipping the calculations doesn't seem to be working because m turns out to be zero.
Am I doing some mistake in my calculation, or if this is indeed the correct way to arrive at the result, this I am unable to understand. Please help me solve my problem.
CodePudding user response:
The duplicated method returns True if duplicated rows are found in your data. Tilde returns the opposite (False). Applying int will make it 0 and 1 values
df['flag_column'] = (~df['to_address'].duplicated()).apply(int)
CodePudding user response:
This code drops duplicates from pandas data frame based on column:
import pandas as pd
# Replace this with your dataframe
df = pd.DataFrame([{"name": "apple", "fruit": True, "vegetable": False, "value": 1},
{"name": "banana", "fruit": True, "vegetable": False, "value": 2},
{"name": "carrot", "fruit": False, "vegetable": True, "value": 1}])
print(df) # My dataframe
# name fruit vegetable value
# 0 apple True False 1
# 1 banana True False 2
# 2 carrot False True 1
df = df.drop_duplicates(subset="value") # Change column name to yours
# Print new dataframe
print(df)
# name fruit vegetable value
# 0 apple True False 1
# 1 banana True False 2