So I try scraping multiple news by providing it into a dataframe that will converted into list.But when I insert into dataframe it only give last value of scraping but print have show different result.My example df is like this
df = {data:['https://www.liputan6.com/bisnis/read/4661489/erick-thohir-apresiasi-transformasi-digital-pos-indonesia],
[https://ekonomi.bisnis.com/read/20211010/98/1452514/hari-pos-sedunia-pos-indonesia-kasih-diskon-70-persen-paket-kilat]}
this is my code
import pandas as pd
import newspaper
from newspaper import Article
df = pd.read_excel(' 1.xlsx')
urls = df['data'].to_list()
for url in urls:
try:
a = Article(url, language='id')
a.download()
a.parse()
author = a.authors
dates = a.publish_date
add_data = a.additional_data
text = a.text
tag = a.tags
title = a.title
keywords = a.keywords
new_df = pd.DataFrame({'author':[author]}) #it need in [] because it can be multiple
print(author,dates,add_data,text,tag,title,keywords)
except Exception as e:
print(e)
when I run the print(author)
it show a result
['S. Dian Andryanto', 'Reporter', 'Editor']
['Ali Akhmad Noor Hidayat', 'Reporter', 'Editor']
But when I inserted to dataframe they only return the last value
new_data = {"author":['Ali Akhmad Noor Hidayat', 'Reporter', 'Editor']}
Anyone can explain how to get all my author inserted to dataframe?
CodePudding user response:
You are looping through the urls
list and inside the loop you re-create each time the entire DataFrame stored in new_df
. To avoid this, you can
create an external dictionary and at the end of the loop create the entire DataFrame, like in the following code:
import pandas as pd
import newspaper
from newspaper import Article
df = pd.read_excel(' 1.xlsx')
urls = df['data'].to_list()
all_authors = {"author": []}
for url in urls:
try:
a = Article(url, language='id')
a.download()
a.parse()
author = a.authors
dates = a.publish_date
add_data = a.additional_data
text = a.text
tag = a.tags
title = a.title
keywords = a.keywords
all_authors['author'].append(author) #it need in [] because it can be multiple
except Exception as e:
print(e)
new_df = pd.DataFrame(data=all_authors)
CodePudding user response:
Collect new_df
in a list and concat them at the end.
I slightly modified your code because catch all exception is a bad idea, use newspaper.ArticleException
instead.
urls = ['https://www.liputan6.com/bisnis/read/4661489/erick-thohir-apresiasi-transformasi-digital-pos-indonesia',
'https://ekonomi.bisnis.com/read/20211010/98/1452514/hari-pos-sedunia-pos-indonesia-kasih-diskon-70-persen-paket-kilat']
data = []
for url in urls:
try:
a = Article(url, language='id')
a.download()
a.parse()
except newspaper.ArticleException as e:
print(e)
else:
author = a.authors
dates = a.publish_date
add_data = a.additional_data
text = a.text
tag = a.tags
title = a.title
keywords = a.keywords
new_df = pd.DataFrame({'author':[author]})
data.append(new_df)
df = pd.concat(data)