Home > OS >  Python: Deleting duplicates from Dataframe
Python: Deleting duplicates from Dataframe

Time:04-07

When I output dfges, I get duplicates for a given element. For example, for Madrid the messages are all displayed three times each. Does anyone know a way to work around this problem or remove the duplicates? Is it possibly due to the upper For loop?

import ssl 
import smtplib 
import pandas as pd 
from GoogleNews import GoogleNews 
ssl._create_default_https_context = ssl._create_unverified_context
    
googlenews = GoogleNews() 
googlenews.set_encode('utf_8') 
googlenews.set_lang('en') 
googlenews.set_period('7d')
    
orte = ["Munich", "New York", "Madrid", "London", "Los Angeles", "Frankfurt", "Rom"] 
Nachrichten = []
    
for ort in orte: 
    googlenews.clear() 
    googlenews.get_news(ort) 
    table_new = [] 
    
    for row in googlenews.results(): 
        table_new.append({ 
            'City': ort, 
            'Title': row['title'], 
            'Date': row['date'], 
            'URL':row['link'], 
            'Source': row['site'], }) 
    
        df = pd.DataFrame(table_new) 
    
    nachrichten.append(df)

dfges = pd.concat(nachrichten, axis='index')
print(dfges)
´´´

CodePudding user response:

Seems you need a dfges.drop_duplicates(inplace=True). This will remove all duplicate rows. Note that two rows are considered duplicates only if they have the same values in each field (except the index).

See the doc for other details: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

Provide an example of your Dataframe dfges if you need more help.

CodePudding user response:

As articles can be tagged with multiple locations e. g. "Munich" and "Madrid", hence appear multiple times in your df, you should detect and delete duplicates using drop_duplicates(subset=...) based on the 'Title' (and more criteria), as otherwise no duplicates are deleted add at all.

Your code could look like this:

...
dfges = pd.concat(nachrichten, axis='index')
dfges.drop_duplicates(subset=['Title'], keep='last', inplace=True)
print(dfges)
...
  • Related