different values between print and pd.DataFrame-CodePudding

So I try scraping multiple news by providing it into a dataframe that will converted into list.But when I insert into dataframe it only give last value of scraping but print have show different result.My example df is like this

df = {data:['https://www.liputan6.com/bisnis/read/4661489/erick-thohir-apresiasi-transformasi-digital-pos-indonesia],
            [https://ekonomi.bisnis.com/read/20211010/98/1452514/hari-pos-sedunia-pos-indonesia-kasih-diskon-70-persen-paket-kilat]}

this is my code

import pandas as pd
import newspaper
from newspaper import Article
df = pd.read_excel(' 1.xlsx')
urls = df['data'].to_list()


for url in urls:
    try:
        a = Article(url, language='id')
        a.download()
        a.parse()

        author = a.authors
        dates = a.publish_date
        add_data = a.additional_data
        text = a.text
        tag = a.tags
        title = a.title
        keywords = a.keywords

        new_df = pd.DataFrame({'author':[author]}) #it need in [] because it can be multiple  
        print(author,dates,add_data,text,tag,title,keywords)

    except Exception as e:
        print(e)

when I run the print(author) it show a result

['S. Dian Andryanto', 'Reporter', 'Editor']
['Ali Akhmad Noor Hidayat', 'Reporter', 'Editor']

But when I inserted to dataframe they only return the last value

new_data = {"author":['Ali Akhmad Noor Hidayat', 'Reporter', 'Editor']}

Anyone can explain how to get all my author inserted to dataframe?

CodePudding user response：

You are looping through the urls list and inside the loop you re-create each time the entire DataFrame stored in new_df. To avoid this, you can create an external dictionary and at the end of the loop create the entire DataFrame, like in the following code:

import pandas as pd
import newspaper
from newspaper import Article
df = pd.read_excel(' 1.xlsx')
urls = df['data'].to_list()

all_authors = {"author": []}
for url in urls:
   try:
     a = Article(url, language='id')
     a.download()
     a.parse()

     author = a.authors
     dates = a.publish_date
     add_data = a.additional_data
     text = a.text
     tag = a.tags
     title = a.title
     keywords = a.keywords

     all_authors['author'].append(author) #it need in [] because it can be multiple  
    

   except Exception as e:
     print(e)
new_df = pd.DataFrame(data=all_authors)

CodePudding user response：

Collect new_df in a list and concat them at the end.

I slightly modified your code because catch all exception is a bad idea, use newspaper.ArticleException instead.

urls = ['https://www.liputan6.com/bisnis/read/4661489/erick-thohir-apresiasi-transformasi-digital-pos-indonesia',
        'https://ekonomi.bisnis.com/read/20211010/98/1452514/hari-pos-sedunia-pos-indonesia-kasih-diskon-70-persen-paket-kilat']

data = []
for url in urls:
    try:
        a = Article(url, language='id')
        a.download()
        a.parse()

    except newspaper.ArticleException as e:
        print(e)

    else:    
        author = a.authors
        dates = a.publish_date
        add_data = a.additional_data
        text = a.text
        tag = a.tags
        title = a.title
        keywords = a.keywords

        new_df = pd.DataFrame({'author':[author]})
        data.append(new_df)        

df = pd.concat(data)