Home > Blockchain >  How to maintain terminal json object print behavior while pushing to Google Sheet?
How to maintain terminal json object print behavior while pushing to Google Sheet?

Time:09-16

So this sounds very specific, but I'm not sure what seems to cause the issue. I've tried adding in tweaks like "tostring" and "str" throughout my code, but no luck (maybe I'm just not putting in the right place)...

Basically, I'm trying to take this script I was helped with yesterday (which prints website data):

import requests
from bs4 import BeautifulSoup as bs
import json

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'

r = requests.get(url, headers=headers)

soup = bs(r.text, 'html.parser')

page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
print('Title:', json_obj['props']['pageProps']['story']['header']['headline'])
print('Date:', json_obj['props']['pageProps']['story']['date'])
print('Content:', json_obj['props']['pageProps']['story']['content'])

and alter it so the information is pushed to a Google Sheet.

I went through and created the credentials, etc, and verified the connection between Python and this sheet is active (just by adding values, not the results of my script).. the broken script (in it's current state) is below:

import requests
from bs4 import BeautifulSoup as bs
import json
import gspread

gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('1NFrhsJT7T0zm3dRaP5J8OY0FryBHy5W_wEEGvwBg58I')
worksheet = sh.sheet1

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}

url = 'https://www.nba.com/game/bkn-vs-phi-0022100993'

r = requests.get(url, headers=headers)

soup = bs(r.text, 'html.parser')

page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
title = (json_obj['props']['pageProps']
         ['story']['header']['headline'])
date = (json_obj['props']['pageProps']['story']['date'])
content = (json_obj['props']['pageProps']['story']['content'])

AddData = [title, date, content]
worksheet.append_row(AddData)

While the first script runs so great (pushes clean results in the terminal like below): Clean Results

The same script (for the most part) alters the data, and leaves the Google Sheet BLANK (while like I mentioned it runs just adding in values)

This is what I'm seeing in the results here (many errors, and \n representing the line breaks, rather than the clean text).

Like I said, I thought maybe putting the variables as .string or .text would help - maybe I just wasn't doing it correctly.

Further testing - when I JUST print the DATE variable, it moves correctly to the sheet. Trying to figure out why the other variables won't.

Any thoughts are greatly appreciated!

CodePudding user response:

You can try this option. It worked for me

content = str(json_obj['props']['pageProps']['story']['content'])

CodePudding user response:

You can make dict then append with append_row(AddData) instead of list which is AddData = [title, date, content]

AddData={
    'title':title,
    'date':date,
    'content':content
    }
worksheet.append_row(AddData)

CodePudding user response:

In addition to jkk's answer. The reason it is throwing an error is because for some reason the Content being returned is enclosed in square brackets [ ] therefore causing some data or formatting issues. enter image description here

The workaround to this is you need to convert it to string, but the brackets on both ends and the apostrophe(') in the beginning will be included. You can either keep the result with the Brackets or remove them so you only have plain text. To remove them you need to use [2:-1], this removes the first 2 characters and the last character of the string.

Try to change:

content = (json_obj['props']['pageProps']['story']['content'])

To:

content = str(json_obj['props']['pageProps']['story']['content'])[2:-1]

Result: enter image description here

  • Related