Home > Back-end >  Lambda Function over dataframe not working
Lambda Function over dataframe not working

Time:08-18

Hi i have used this line of code for lambda function which basically extracts a pdf url's text and i have a column with just different pdf links and coded a lambda function which basically takes in the pdf links column link and returns the text of the pdf in another column called Content .

result = result.assign(Content = lambda x: ( urltotext(x['Source']) ))

Here's my whole code

import requests
import pandas as pd
from datetime import datetime
from datetime import date
import json
import urllib.request
import PyPDF2
import fitz
import multiprocessing
import requests_cache 


def urltotext(link):
    try:
        req = urllib.request.urlopen(link)
        file = open("DailyCA.pdf",'wb')
        file.write(req.read())
        file.close()
        doc = fitz.open("DailyCA.pdf") 
        text = []
        for page in doc:
            temptext = page.get_text('text')
            text.append(temptext)
        text = 'shodhpage'.join(map(str, text))
        doc.close()
        return text
    except Exception as e:
        text = "none"
        return text


def all():
    print("Started Pulling")
    currentd = date.today()
    s = requests_cache.CachedSession('demo_cache', backend='sqlite')
    headers =   {'Host':'www.nseindia.com','User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0','Accept':'text/html,application/xhtml xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language':'en-US,en;q=0.5', 'Accept-Encoding':'gzip, deflate, br','DNT':'1', 'Connection':'keep-alive', 'Upgrade-Insecure-Requests':'1','Pragma':'no-cache','Cache-Control':'no-cache',  }
    url = 'https://www.nseindia.com/'
    step = s.get(url,headers=headers)
    today = datetime.now().strftime('%d-%m-%Y')
    api_url = f'https://www.nseindia.com/api/corporate-announcements?index=equities&from_date=01-01-2022&to_date=18-08-2022'
    resp = s.get(api_url,headers=headers).json()
    print("API Read")
    result = pd.DataFrame(resp)
    result.drop(['difference', 'dt','exchdisstime','csvName','old_new','orgid','seq_id','bflag','symbol','sort_date'], axis = 1, inplace = True)
    result.rename(columns = {'an_dt':'DateandTime', 'attchmntFile':'Source','attchmntText':'Topic','desc':'Type','smIndustry':'Sector','sm_name':'Company Name','sm_isin':'ISIN'}, inplace = True)
    result[['Date','Time']] = result.DateandTime.str.split(expand=True)
    result = result[result['Type'].str.contains("Loss of Share Certificates|Copy of Newspaper Publication") == False]
    result['Type'] = result['Type'].astype(str)
    result['Type'].replace("Certificate under SEBI (Depositories and Participants) Regulations, 2018",'Junk' , inplace = True)
    result = result[result['Type'].str.contains("Junk") == False]
    result = result[result["Type"].str.contains("Trading Window") == False]
    result = result[result["Type"].str.contains("Loss of share certificate") == False]
    result = result[result["Type"].str.contains("Loss of share certificates") == False]
    result = result[result["Type"].str.contains("Disclosure under SEBI Takeover Regulations") == False]
    result = result[result["Type"].str.contains("Newspaper Advertisements") == False]
    result = result[result["Type"].str.contains("-") == False]
    result.drop_duplicates(subset='Source', keep = 'first', inplace = True)
    result['Temporary']=pd.to_datetime(result['Date'] ' ' result['Time'])
    result['Date']=result['Temporary'].dt.strftime('%b %d, %Y')
    result['Time']=result['Temporary'].dt.strftime('%R %p')
    result['DateTime'] = pd.to_datetime(result['Temporary'])
    result['DateTime'] = result['Temporary'].dt.strftime('%m/%d/%Y %I:%M %p')
    result.drop(['DateandTime', 'Temporary'], axis = 1, inplace = True)
    result = result.assign(Content = lambda x: ( urltotext(x['Source']) ))
    result.to_csv("2018-Test.csv")

all()

And here's how the sample rows and colums of the database i am using looks like

,Type,Source,Company Name,ISIN,Sector,Topic,Date,Time,DateTime,Equity,NSE
0,Updates,https://archives.nseindia.com/corporate/MOIL_01012019221502_Letter_SE_01012019_Change_Price_MnOre_160.pdf,MOIL Limited,INE490G01020,Metals,MOIL Limited has informed the Exchange regarding 'Fixation of prices of different grades Manganese Ore for 4th Quarter 2018-19 (January-Marchᅡメ2019) effective from 01.01.2019'.,"Jan 01, 2019",22:16 PM,01/01/2019 10:16 PM,yes,yes
1,Appointment,https://archives.nseindia.com/corporate/ICICIPRULI_01012019210925_SE_Intimation_appointment_of_director_01_01_2019_159.pdf,ICICI Prudential Life Insurance Company Limited,INE726G01019,,"ICICI Prudential Life Insurance Company Limited has informed the Exchange regarding Appointment of Ms Vibha Paul Rishi as Non- Executive Independent Director of the company w.e.f. January 01, 2019.","Jan 01, 2019",21:27 PM,01/01/2019 09:27 PM,yes,yes
2,Cessation,https://archives.nseindia.com/corporate/SUULD_01012019203402_IntimationLetter_158.pdf,Suumaya Industries Limited,INE591Q01016,,"Suumaya Lifestyle Limited has informed the Exchange regarding Cessation of Ms Priya Gandhi as Company Secretary & Compliance Officer of the company w.e.f. November 16, 2018.","Jan 01, 2019",20:35 PM,01/01/2019 08:35 PM,yes,yes
3,Updates,https://archives.nseindia.com/corporate/COALINDIA_01012019194358_01012019193112_156.pdf,Coal India Limited,INE522F01014,Mining,Coal India Limited has informed the Exchange regarding 'Provisional Production and offtake performance of CIL and its Subsidiary Companies for the month of Dec 18 and for the period Apr18 to Dec 18'.,"Jan 01, 2019",19:44 PM,01/01/2019 07:44 PM,yes,yes

CodePudding user response:

I believe that your problem is, that requests expects a string or Request object and you are providing a pd.Series object. You can either modify to use the values of the series like

result = result.assign(Content = lambda x: ( urltotext(x['Source'].values) )

or better, because of readability use the map function directly:

result["Content"] = result["Source"].map(urltotext)
  • Related