The code uses an OCR to read text from URLs in the list 'url_list'. I am trying to append the output in the form of a string 'txt' into an empty pandas column 'url_text'. However, the code does not append anything to the column 'url_text'? When
df = pd.read_csv(r'path') # main dataframe
df['url_text'] = "" # create empty column that will later contain the text of the url_image
url_list = (df.iloc[:, 5]).tolist() # convert column with urls to a list
print(url_list)
['https://pbs.twimg.com/media/ExwMPFDUYAEHKn0.jpg',
'https://pbs.twimg.com/media/ExuBd4-WQAMgTTR.jpg',
'https://pbs.twimg.com/media/ExuBd5BXMAU2-p_.jpg',
' ',
'https://pbs.twimg.com/media/Ext0Np0WYAEUBXy.jpg',
'https://pbs.twimg.com/media/ExsJrOtWUAMgVxk.jpg',
'https://pbs.twimg.com/media/ExrGetoWUAEhOt0.jpg',
' ',
' ']
for img_url in url_list: # loop over all urls in list url_list
try:
img = io.imread(img_url) # convert image/url to cv2/numpy.ndarray format
# Preprocessing of image
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
gry = cv2.resize(gry, (w*3, h*3))
thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY cv2.THRESH_OTSU)[1]
txt = pytesseract.image_to_string(thr) # read tweet image text
df['url_text'].append(txt)
print(txt)
except: # ignore any errors. Some of the rows does not contain a URL causing the loop to fail
pass
print(df)
CodePudding user response:
I couldn't test it but please try this, as you may need to create the list first and then add it as a new column to the df (I converted the list itself to dataframe and then concatenated to the original df)
txtlst=[]
for img_url in url_list: # loop over all urls in list url_list
try:
img = io.imread(img_url) # convert image/url to cv2/numpy.ndarray format
# Preprocessing of image
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(h, w) = gry.shape[:2]
gry = cv2.resize(gry, (w*3, h*3))
thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY cv2.THRESH_OTSU)[1]
txt = pytesseract.image_to_string(thr) # read tweet image text
txtlst.append(txt)
print(txt)
except: # ignore any errors. Some of the rows does not contain a URL causing the loop to fail
txtlst.append("")
pass
dftxt=pd.Dataframe({"url_text":txtlst})
df=pd.concat([df, dftxt, axis=1)
print(df)
CodePudding user response:
As noted in the documentation for Series.append(), the append call works only between two series.
Better will be to create an empty list outside of the loop, append to that list of strings within the loop itself, and then insert that list into df["url_list"] = list_of_urls
. This is also much faster at runtime than appending two series together repeatedly.
url_list = []
for ...:
...
url_list.append(url_text)
df["url_list"] = url_list