I have a python script for extracting some content. It works by loading urls from a csv file I have and outputing it in a csv. The content is such that some of it has a div class that has some unformatted text. Trying to scrap that is proving difficult. How can I tweak my code to capture that. The unformatted text is not in all the webpages so I have added an error handling statement.
Also is there a way I can have the unformatted text in the same column as Content rather than having it on its own column?
urls = ['https://www.studypool.com/discuss/18233577/obtain-a-copy-of-the-financial-statements-for-a-publicly-traded-company-then-complete-a-ratio-analysis','https://www.studypool.com/discuss/18898929/financial-accounting-questions-multiple-choice-about-the-chapter-cash-amp-investments',
'https://www.studypool.com/discuss/18237517/compare-forms-of-fundamental-and-technical-analyses'
]
def transform(url):
r = requests.get(str(url))
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
title = soup.find('h1',{'class':"question-title"})
content = soup.find('div',{'class':'user-generated-description'})
textbox = soup.find('div', {'class':'unformatted-text-box'})
try:
textbox = textbox.find('a',{'rel':'unformatted-text-box'}).text.strip()
except:
textbox = ''
row = {'Title':title.text,
'Content':content.text,
'Textbox':textbox}
CodePudding user response:
This is one way of achieving your goal:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from tqdm import tqdm
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
urls = ['https://www.studypool.com/discuss/18233577/obtain-a-copy-of-the-financial-statements-for-a-publicly-traded-company-then-complete-a-ratio-analysis','https://www.studypool.com/discuss/18898929/financial-accounting-questions-multiple-choice-about-the-chapter-cash-amp-investments',
'https://www.studypool.com/discuss/18237517/compare-forms-of-fundamental-and-technical-analyses'
]
big_list = []
s = requests.Session()
s.headers.update(headers)
for url in tqdm(urls):
r = s.get(url)
soup = bs(r.text, 'html.parser')
title = soup.select_one('h1.question-title').get_text(strip=True)
content = soup.select_one('div.user-generated-description').text.strip()
try:
textbox = soup.select_one('div.unformatted-text-box').text.strip()
except Exception as e:
textbox = 'not specified'
big_list.append((title, content '\n' textbox))
df = pd.DataFrame(big_list, columns = ['Title', 'Content'])
df.to_csv('saved_data.csv')
print(df)
Result printed in terminal:
Title Content
0 University of Illinois at Chicago Accurate Reporting of Social Media Use Discussion This is an assignment on ratio analysis. You need to obtain a copy of the financial statements for a publicly traded company. Then choose as many ratios as possible. (I suggest choosing 4-5, the professor's requirement is at least three) I will send you the specific requirements as an attachment.\nFinancial Statement Analysis\nNew Focus Consulting, 2014\nChapter 13\nRatios and Trend Analysis\nChapter 13\nHorizontal Analysis: source, value investing basics\nChapter 13\nCalculate Income Statement 2013 vertical\namounts and 2011 to 2013 horizontal\namounts.\nChapter 13\nCurrent Ratio: Used to determine a company’s\nability to repay short-term debts.\nCurrent Assets\nCurrent Liabilities\nChapter 13\nQuick Ratio: Addressed liquidity by using cash\nand current assets that can be most quickly\nconverted to cash(quick assets).\nQuick Assets\nCurrent Liabilities\nChapter 13\nInventory Turnover Ratio: Number of times the inventory of\na company is sold and replaced over a specified period\nof time.\nCost of Goods Sold\nAverage Inventory at Cost\nChapter 13\nAccounts Receivable Turnover Ratio: Calculates\nhow quickly a company turns it credit sales into\ncash.\nCredit Sales\nAverage Accounts Receivable\nChapter 13\nAverage Collection Period Ratio: The\naverage number of days it takes for a\ncompany to collect its accounts receivable.\nAvg. Accounts Receivable\n(Sales/360)\nChapter 13\nDebt to Equity Ratio: Calculates the amount\nof debt as a percentage of equity. Some\nanalysts will use total liabilities as debt.\nTotal Debt\nTotal Equity\nChapter 13\nGross Profit Margin Ratio: Determines the\nprofitability of a company through direct\nexpenses. Used to evaluate efficiency of\noperations.\nSales – Cost of Goods Sold\nSales\nChapter 13\nOperating Margin Ratio: Determines the\nprofitability percentage from a company’s\noperations.\nOperating Income\nSales\nChapter 13\nNet Profit Margin Ratio: Determines the profit\nof a company after it meets the obligations\nfor a specific period.\nNet Profit\nSales\nChapter 13\nReturn on Equity Ratio: Indicates the return\nearned by the owners(investors) for a\nperiod.\nNet Profit\nAverage Owners Equity\nChapter 13\nEarnings Per Share Ratio: The theoretical\nearnings per each outstanding share.\nNet Income – Preferred Dividends\nAverage Number of Common\nShares Outstanding\nChapter 13\nThe prior ratios were some examples of\nratios and analysis. There are a number\nmore. Some not presented were ratios\nusing assets as a denominator. In my\nopinion, they are less telling than other\nratios.\nNew Focus Consulting\nFinancial Statement & Ratio Assignment\nObtain a copy of the financial statements for a publicly traded company.\nSelect three of the ratios presented in class or from Financial Statement Analysis\nand show the calculations for your selected company.\nCALCULATE FOR AT LEAST THE LAST THRE YEARS. ONE OF THE\nYEARS MUST BE DURING THE YEAR ENDED IN 2018.\nRemember, ratios are most relevant when compared to a companies' own\nhistorical, industry or competitors trends. For the above calculations, what\nstory do they tell? Provide an explanation for each of the three ratios presented.\nThe assignment will be at least two pages, not more than four pages.\nNote: Apple Inc, Samsung or Tesla are not allowed to be used for this assignment.\nNew Focus Consulting\n2007\nNew Focus Consulting\nFinancial Indicators & Ratios\nUsed to understand trends of a company. Most useful when compared to\na company's historical information or industry average.\nAccounts Receivable Turnover: Net credit sales over average accounts receivable. Measures\nhow quickly customers pay their bills.\nCapitalization Rate: Calculated as net income over owners investment, and\n(Cap Rate) reflects the rate of return a property will produce on an\ninvestment.\nCash Debt Coverage Ratio: Net cash from operating activities over total liabilities.\nMeasures a company's ability to repay its liabilities from cash\ngenerated from operations without liquidating assets.\nCost/Income Ratio: Total expenses divided by total expenses.\nCurrent Ratio: Current assets over current liabilities. Used by lending\ninstitutions to determine a company's ability to repay\nshort-term debts.\nDebt Coverage Ratio: Net income of an investment over the debt service of the\ninvestment.\nDebt to Equity Ratio: Total debt(longterm and shortterm) over total equity. Lending\ninstitutions will usuall be concerned with a companies\nDebt to Equity ratio over .5 to .75.\nDividend Yield Ratio: Annual dividends over current market share price of stock.\nLong Term Debt to Equity Ratio: Long term debt over owner's equity. In general, a zero to .3\nNew Focus a\nConsulting\nratio is considered\nrelatively low debt exposure.\n2006\nOperating Ratio: Operating revenues over operating expenses. When\ncompared to other periods or industry averages, helps\nmeasure a company's operating efficiency.\nPrice/Earnings Ratio: Current price of a stock divided by actual earning per share.\n(P/E Ratio)\nReturn on Investment: Net Income divided by net book value(total assets minus\n(ROI) intangible assets and liabilities).\nNew Focus Consulting\n2006\n\nPurchase answer to see full\nattachment
1 Financial Accounting Cash & Investments Multiple Choice Questions 1)The following information regarding the cash activities of Roves Ltd. for the month of April 20x5 is given below:Cash balance per books, April 12522