When I output dfges, I get duplicates for a given element. For example, for Madrid the messages are all displayed three times each. Does anyone know a way to work around this problem or remove the duplicates? Is it possibly due to the upper For loop?
import ssl
import smtplib
import pandas as pd
from GoogleNews import GoogleNews
ssl._create_default_https_context = ssl._create_unverified_context
googlenews = GoogleNews()
googlenews.set_encode('utf_8')
googlenews.set_lang('en')
googlenews.set_period('7d')
orte = ["Munich", "New York", "Madrid", "London", "Los Angeles", "Frankfurt", "Rom"]
Nachrichten = []
for ort in orte:
googlenews.clear()
googlenews.get_news(ort)
table_new = []
for row in googlenews.results():
table_new.append({
'City': ort,
'Title': row['title'],
'Date': row['date'],
'URL':row['link'],
'Source': row['site'], })
df = pd.DataFrame(table_new)
nachrichten.append(df)
dfges = pd.concat(nachrichten, axis='index')
print(dfges)
´´´
CodePudding user response:
Seems you need a dfges.drop_duplicates(inplace=True)
. This will remove all duplicate rows. Note that two rows are considered duplicates only if they have the same values in each field (except the index).
See the doc for other details: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html
Provide an example of your Dataframe dfges
if you need more help.
CodePudding user response:
As articles can be tagged with multiple locations e. g. "Munich"
and "Madrid"
, hence appear multiple times in your df, you should detect and delete duplicates using drop_duplicates(subset=...)
based on the 'Title'
(and more criteria), as otherwise no duplicates are deleted add at all.
Your code could look like this:
...
dfges = pd.concat(nachrichten, axis='index')
dfges.drop_duplicates(subset=['Title'], keep='last', inplace=True)
print(dfges)
...