Home > Enterprise >  Scrapping financial statements from roic.ai
Scrapping financial statements from roic.ai

Time:10-17

Has anyone ever scrapped (e.g. into dataframe) financial statements available at roic.ai?

The source code of the page is very nested and obtaining the statements is not straightforward:

from gazpacho import get, Soup

ticker = 'aapl'
url = f'https://roic.ai/financials/{ticker}?fs=annual'
print(url)

html = get(url)
soup = Soup(html)

soup.find('div', {'class', "flex-col"})

CodePudding user response:

You can load the Json data from the <script> inside the page:

import re
import json
import requests
from bs4 import BeautifulSoup

ticker = "aapl"
url = f"https://roic.ai/financials/{ticker}?fs=annual"


soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").text)

# umcomment this to print all data:
# print(json.dumps(data, indent=4))

# load sample data as pandas DataFrame
df = pd.DataFrame(data["props"]["pageProps"]["data"]["data"]["bsq"])
print(df.head().to_markdown(index=False))

Prints:

date symbol reportedCurrency cik fillingDate acceptedDate calendarYear period cashAndCashEquivalents shortTermInvestments cashAndShortTermInvestments netReceivables inventory otherCurrentAssets totalCurrentAssets propertyPlantEquipmentNet goodwill intangibleAssets goodwillAndIntangibleAssets longTermInvestments taxAssets otherNonCurrentAssets totalNonCurrentAssets otherAssets totalAssets accountPayables shortTermDebt taxPayables deferredRevenue otherCurrentLiabilities totalCurrentLiabilities longTermDebt deferredRevenueNonCurrent deferredTaxLiabilitiesNonCurrent otherNonCurrentLiabilities totalNonCurrentLiabilities otherLiabilities capitalLeaseObligations totalLiabilities preferredStock commonStock retainedEarnings accumulatedOtherComprehensiveIncomeLoss othertotalStockholdersEquity totalStockholdersEquity totalLiabilitiesAndStockholdersEquity minorityInterest totalEquity totalLiabilitiesAndTotalEquity totalInvestments totalDebt netDebt link finalLink
06/25/2022 AAPL USD 0000320193 2022-07-29 2022-07-28 18:06:56 2022 Q3 27502000000 20729000000 48231000000 42242000000 5433000000 16386000000 112292000000 40335000000 0 0 0 131077000000 0 52605000000 224017000000 0 336309000000 48343000000 24991000000 0 7728000000 48811000000 129873000000 94700000000 0 0 53629000000 148329000000 0 0 278202000000 0 62115000000 5289000000 -9297000000 0 58107000000 336309000000 0 58107000000 336309000000 151806000000 119691000000 92189000000 https://www.sec.gov/Archives/edgar/data/320193/000032019322000070/0000320193-22-000070-index.htm https://www.sec.gov/Archives/edgar/data/320193/000032019322000070/aapl-20220625.htm
03/26/2022 AAPL USD 0000320193 2022-04-29 2022-04-28 18:03:58 2022 Q2 28098000000 23413000000 51511000000 45400000000 5460000000 15809000000 118180000000 39304000000 0 0 0 141219000000 0 51959000000 232482000000 0 350662000000 52682000000 16658000000 0 7920000000 50248000000 127508000000 103323000000 0 0 52432000000 155755000000 0 0 283263000000 0 61181000000 12712000000 -6494000000 0 67399000000 350662000000 0 67399000000 350662000000 164632000000 119981000000 91883000000 https://www.sec.gov/Archives/edgar/data/320193/000032019322000059/0000320193-22-000059-index.htm https://www.sec.gov/Archives/edgar/data/320193/000032019322000059/aapl-20220326.htm
12/25/2021 AAPL USD 0000320193 2022-01-28 2022-01-27 18:00:58 2022 Q1 37119000000 26794000000 63913000000 65253000000 5876000000 18112000000 153154000000 39245000000 0 0 0 138683000000 0 50109000000 228037000000 0 381191000000 74362000000 16169000000 41241000000 7876000000 49167000000 147574000000 106629000000 0 0 55056000000 161685000000 0 0 309259000000 0 58424000000 14435000000 -927000000 0 71932000000 381191000000 0 71932000000 381191000000 165477000000 122798000000 85679000000 https://www.sec.gov/Archives/edgar/data/320193/000032019322000007/0000320193-22-000007-index.htm https://www.sec.gov/Archives/edgar/data/320193/000032019322000007/aapl-20211225.htm
09/25/2021 AAPL USD 0000320193 2021-10-29 2021-10-28 18:04:28 2021 Q4 34940000000 27699000000 62639000000 51506000000 6580000000 14111000000 134836000000 39440000000 0 0 0 127877000000 0 48849000000 216166000000 0 351002000000 54763000000 15613000000 0 7612000000 47493000000 125481000000 109106000000 0 0 53325000000 162431000000 0 0 287912000000 0 57365000000 5562000000 163000000 0 63090000000 351002000000 0 63090000000 351002000000 155576000000 124719000000 89779000000 https://www.sec.gov/Archives/edgar/data/320193/000032019321000105/0000320193-21-000105-index.htm https://www.sec.gov/Archives/edgar/data/320193/000032019321000105/aapl-20210925.htm
06/26/2021 AAPL USD 0000320193 2021-07-28 2021-07-27 18:03:42 2021 Q3 34050000000 27646000000 61696000000 33908000000 5178000000 13641000000 114423000000 38615000000 0 0 0 131948000000 0 44854000000 215417000000 0 329840000000 40409000000 16039000000 0 7681000000 43625000000 107754000000 105752000000 0 0 52054000000 157806000000 0 0 265560000000 0 54989000000 9233000000 58000000 0 64280000000 329840000000 0 64280000000 329840000000 159594000000 121791000000 87741000000 https://www.sec.gov/Archives/edgar/data/320193/000032019321000065/0000320193-21-000065-index.htm https://www.sec.gov/Archives/edgar/data/320193/000032019321000065/aapl-20210626.htm

CodePudding user response:

from gazpacho import Soup
import json
import pandas as pd

ticker = 'aapl'
url = f'https://roic.ai/financials/{ticker}?fs=annual'
soup = Soup.get(url)
scrapped_data = soup.find('script', {'id': "__NEXT_DATA__"})
data = json.loads(scrapped_data.text)
df = pd.DataFrame(data["props"]["pageProps"]["data"]["data"]["bsq"])
print(df.head())

It can be implemented like this. Don't forget to import pandas and JSON libraries.

  • Related