Home > OS >  drop.duplicates() altering data?
drop.duplicates() altering data?

Time:05-05

I've a df that come from a scrap and return a lot of unnecessary data, but when i use drop.duplicate, some of columns are presenting altered values.

That my original df:

    |Date      | Sex | Name |
    |01-02-2022| F   | A    |
    |09-02-2022| F   | A    |
    |10-02-2022| M   | B    |
    |27-02-2022| M   | B    |

when i use df.drop_duplicates('Name', keep = 'first') i receive back

        |Date      | Sex | Name |
        |01-02-2022| F   | A    |
        |01-02-2022| F   | B    |

Why?

Exist other better way to keep first value of A and B thinking in a huge df?

 `PATH =`  (r"C:\Users\Gustavo.vieira\Desktop\python\drivers\msedgedriver.exe")
    cols = ['data_evento', 'data_liquidacao','evento', 'tx%', 'valor_pago','status','Ativo']
    url_cr = 'https://data.anbima.com.br/certificado-de-recebiveis/{}/agenda'
    lista_teste = ['CRA0160000P','CRA0160000X','CRA017001P6']
    data = []
    blank_row = []
    final_df = pd.DataFrame()


#Will be necessary use selenium, cause the url_cr and url_deb use a different 'g-google-authorization for each asset
for cr in lista_teste:
    driver = webdriver.Edge(PATH)
    driver.get(url_cr.format(cr))
    time.sleep(3)
    soup = BeautifulSoup(driver.page_source)
#With the web opened, the next rows will find the first table, and all rows for that specific table    
    try:
        html_data = soup.find_all('table')[0].find_all('tr')[1:]
        for element in html_data:
#Creating a local variable to extract             
            sub_data = []
            for sub_element in element:
                try:
                    sub_data.append(sub_element.get_text())
                except:
                    continue
                data.append(sub_data)
            cr_df = pd.DataFrame(data)
            cr_df[6] = cr
        final_df = final_df.append(cr_df)
    except:
        print('dados indisponiveis do ativos: {}.'.format(cr))



final_df.columns = cols
today = datetime.date.today()
final_df['data_liquidacao'] = pd.to_datetime(final_df['data_liquidacao'], infer_datetime_format= True)
final_df = final_df[final_df['data_liquidacao'].dt.date > today]
teste = final_df.drop_duplicates(['Ativo'])

CodePudding user response:

Without seeing the actual data, we can only guess at what's going on. My guess is that your data doesn't actually look like you said it does. Here is your code with fake data showing that drop_duplicates actually does work as advertised:

import pandas as pd

data = [
        ['02-01-2022','F','A'],
        ['02-09-2022','F','A'],
        ['02-10-2022','M','B'],
        ['02-27-2022','M','B']
]

df = pd.DataFrame( data, columns=['Date','Sex','Name'])
print(df)
df = df.drop_duplicates('Name',keep='first')
print(df)

Output:

         Date Sex Name
0  02-01-2022   F    A
1  02-09-2022   F    A
2  02-10-2022   M    B
3  02-27-2022   M    B
         Date Sex Name
0  02-01-2022   F    A
2  02-10-2022   M    B

CodePudding user response:

Try to assign the column that you want to drop the duplicated values:

df_foo = df_foo.drop_duplicates(['your_column'])

CodePudding user response:

Here is how the parameters in the drop_duplicates() works,

Parameters:

subset: By default, if the rows have the same values in all the columns, they are considered duplicates. This parameter specifies the columns that only need to be considered for identifying duplicates.

keep: Determines which duplicates (if any) to keep. It takes inputs as, first – Drop duplicates except for the first occurrence. This is the default behavior. last – Drop duplicates except for the last occurrence. False – Drop all duplicates.

inplace: It is used to specify whether to return a new DataFrame or update an existing one. It is a boolean flag with default False.

ignore_index: It is a boolean flag to indicate if row index should be reset after dropping duplicate rows. False: It keeps the original row index. True: It reset the index, and the resulting rows will be labeled 0, 1, …, n – 1.

I tried running your code I got the following output.

Date    Sex Name
0   01-02-2022  F   A
2   10-02-2022  M   B

The reason you are getting your result, might be because you would have sorted the columns 'Date' and 'Sex' before you removed duplicated, and there could be a value '01-02-2022' and 'F' for Name 'B', in that case, don't sort the values before removing duplicates.

and if you want the index altered try,

df = df.drop_duplicates('Name', keep = 'first', ignore_index=True)

Below is the code I tried,

import pandas as pd

dict_ = {"Date": ["01-02-2022", "09-02-2022", "10-02-2022", "27-02-2022"], "Sex": ['F', 'F', 'M', 'M'],
                "Name": ['A', 'A', 'B', 'B']}

df = pd.DataFrame(dict_)
print(df)

# drop duplicate rows
df = df.drop_duplicates('Name', keep = 'first', ignore_index=True)

Hope this helps!

  • Related