Home > Mobile >  Append string to empty pandas column in a for loop
Append string to empty pandas column in a for loop

Time:12-31

The code uses an OCR to read text from URLs in the list 'url_list'. I am trying to append the output in the form of a string 'txt' into an empty pandas column 'url_text'. However, the code does not append anything to the column 'url_text'? When

df = pd.read_csv(r'path') # main dataframe

df['url_text'] = "" # create empty column that will later contain the text of the url_image
url_list = (df.iloc[:, 5]).tolist() # convert column with urls to a list 

print(url_list)

['https://pbs.twimg.com/media/ExwMPFDUYAEHKn0.jpg', 
'https://pbs.twimg.com/media/ExuBd4-WQAMgTTR.jpg', 
'https://pbs.twimg.com/media/ExuBd5BXMAU2-p_.jpg', 
' ',
'https://pbs.twimg.com/media/Ext0Np0WYAEUBXy.jpg', 
'https://pbs.twimg.com/media/ExsJrOtWUAMgVxk.jpg', 
'https://pbs.twimg.com/media/ExrGetoWUAEhOt0.jpg',
' ',
' ']
for img_url in url_list: # loop over all urls in list url_list
    try:
        img = io.imread(img_url) # convert image/url to cv2/numpy.ndarray format

        # Preprocessing of image
        gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        (h, w) = gry.shape[:2]
        gry = cv2.resize(gry, (w*3, h*3))
        thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY   cv2.THRESH_OTSU)[1]

        txt = pytesseract.image_to_string(thr)  # read tweet image text

        df['url_text'].append(txt)

        print(txt)
    except: # ignore any errors. Some of the rows does not contain a URL causing the loop to fail
        pass

print(df)

CodePudding user response:

I couldn't test it but please try this, as you may need to create the list first and then add it as a new column to the df (I converted the list itself to dataframe and then concatenated to the original df)

txtlst=[]
for img_url in url_list: # loop over all urls in list url_list
    try:
        img = io.imread(img_url) # convert image/url to cv2/numpy.ndarray format

        # Preprocessing of image
        gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        (h, w) = gry.shape[:2]
        gry = cv2.resize(gry, (w*3, h*3))
        thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY   cv2.THRESH_OTSU)[1]

        txt = pytesseract.image_to_string(thr)  # read tweet image text
        txtlst.append(txt)


        print(txt)
    except: # ignore any errors. Some of the rows does not contain a URL causing the loop to fail
        txtlst.append("")
        pass
dftxt=pd.Dataframe({"url_text":txtlst})
df=pd.concat([df, dftxt, axis=1)
print(df)

CodePudding user response:

As noted in the documentation for Series.append(), the append call works only between two series.

Better will be to create an empty list outside of the loop, append to that list of strings within the loop itself, and then insert that list into df["url_list"] = list_of_urls. This is also much faster at runtime than appending two series together repeatedly.

url_list = []

for ...:
    ...
    url_list.append(url_text)

df["url_list"] = url_list   
  • Related