Trying to append values to list in a loop and it appends the new value to the end but overwrites eve-CodePudding

I am trying to index words from a csv in per page in a pdftext object. Everything is working fine in the reading and writing except when it goes to append the new page count to the end of the page_list. Instead of just adding the new item to the end of the list, it does that and also replaces every previous item with the newest item. The output captures all the words I expect from my initial csv as well as the expected length of the list, just the values within being overwritten.

What is wrong with my code that it is creating that?

It seems to be an issue with .append() rather than the loop, but I cant seem to figure a work around/solution.

Below is the code I am currently using for this function. The "for w in words" loop is where I believe the issue to lie.

data = pd.read_csv(index)
for da in data:
    words=data[da].values
    with open(pdf, 'rb') as f:
        readerobj = PdfFileReader(f)
        pages = readerobj.numPages
        page_dict={} # eventual index

        # read pdf page by page
        count=0

        for i in range(pages):
            page_list=[]
            if i in range1:
                continue
            else:
                count =1
                page = readerobj.getPage(i)
                reader=page.extractText()
                pdfobj = nltk.word_tokenize(str(reader))

                for w in words:
                    # Overwriting every value in list
                    if str(w) in str(pdfobj):

                        page_list.append(count)
                        page_dict[w]=page_list
            with open('Index.csv', 'w', encoding="utf-8") as x:
                writer = csv.writer(x)
                for key, value in page_dict.items():
                    writer.writerow([key, value])
    f.close()

I am expecting the outputed csv to read like: {word1: [1,4,8,23] word2:[1,3,5,7,22,33]}

But in stead it outputs as: {word: [23,23,23,23] word2:[33,33,33,33,33,33]}

CodePudding user response：

Each time through the for i in range(pages): loop you create a new, empty page_list, which you assign to the dictionary elements. So you're discarding the pages that were appended on previous iterations.

You need to keep the existing list in the dictionary and append to it, not crate a new list each time.

data = pd.read_csv(index)
for da in data:
    words=data[da].values
    with open(pdf, 'rb') as f:
        readerobj = PdfFileReader(f)
        pages = readerobj.numPages
        page_dict={} # eventual index

        # read pdf page by page
        count=0

        for i in range(range1, pages):
            count =1
            page = readerobj.getPage(i)
            reader=page.extractText()
            pdfobj = nltk.word_tokenize(str(reader))

            for w in words:
                # Overwriting every value in list
                if str(w) in str(pdfobj):
                    page_dict.setdefault(w, []).append(count)

Instead of the condition that executes continue for the first range1 pages, you can just start your range there.