Python 3.10.2 with sqlite3 on Windows 10. Relatively new to Python, but I was quite experienced with Perl around 1997-2005.
Banged my head against this for three days. Attacked from lots of angles. No success. I'm asking for guidance, because I don't see myself progressing on my own at this point without help flipping on the light-switch.
My overall application utilizes PRAW
to scrape a subreddit and put the Submissions into a table. Then, we go through that reddit table one row at a time, scanning the selftext
column for any URLs. Disregard some. Clean up others. Insert the ones we want into another table. At a later point, I'll then go through that table, downloading each URL.
My problem is that if I run the below code, with the INSERT
commented out, my print(dl_lnk[0])
line prints out the results of all (currently) 1,400 rows from the reddit table. But if I activate the INSERT
line, it only seems to process the first row in the table. I can assume this to be the case, because the print line only shows a few lines and they are all regarding the same user and the same/similar URL.
I don't understand why this is. I don't think it's because of an error in SQL (though there seems to be fewer options for accessing SQL exceptions in-code than I used to have in Perl, inherently). But I also don't see anything about my flow logic that would make it process just one row when an SQL INSERT happens but processes through all of them when it's commented out.
def scanForDownloadLinks(db):
cursorObj = db.cursor()
# Get all records from 'reddit' submission table.
for reddit_row in cursorObj.execute('''SELECT name, author, subreddit, permalink, title, selftext FROM reddit ORDER BY CREATED_UTC ASC'''):
# Build URL-matching regex
link_regex = re.compile('((https?):((//)|(\\\\)) ([\w\d:#@%/;$()~_?\ -=\\\.&](#!)?)*)', re.DOTALL)
# Find all URLs in this reddit submission's selftext column.
dl_links = re.findall(link_regex, reddit_row[5])
# Go through each link, disregard links we don't want, clean up links we do, then insert them into the dl_link table
for dl_lnk in dl_links:
print(dl_lnk[0])
cursorObj.execute('''INSERT INTO dl_ledger (name, title, dl_link) VALUES(?,?,?)''', (reddit_row[0], str(reddit_row[4]), dl_lnk[0]))
db.commit()
CodePudding user response:
You should use different cursors for the SELECT
and INSERT
queries. When you use the same cursor, the INSERT
resets the cursor so you can't fetch the remaining rows.
Or you could use cursorObj.fetchall()
to get all the results of the SELECT
query as a list and loop through that, rather than looping through the cursor itself. But if there are lots of rows, this will use lots of memory, while looping through the cursor is incremental. But 1400 rows may not be a problem.