During a loop in which I append indexes to a list, said list of indices changes from list to numpy a-CodePudding

I'll start with a little background on what I'm trying to accomplish with this product. I need to filter through a CSV file and search for certain keywords anywhere in the file. Which column it is in is not important to the project. The goal is to get the indices of rows that match the criteria. With the list, we will determine which rows are relevant to the search we are conducting and which are not relevant to the search.

The problem has to do with the list we are appending data to changing types during the loop. During the loop structure, I've written, the list that stores the information runs a few times and then changes from a list to a NumPy array. This then leads to an error and a breaking of the code. I've put in a few try and except statements to find where the error occurs.

Below you'll find the loop I use to try and find the relevant indexes. The type is tracked by the print statements throughout the loop. I've put comments to explain my rationale for some of the print statements I use in this bit of code. The Data referenced in the first loop line is the data frame of the CSV.

import pandas as pd
import numpy as np
data = pd.read_csv('./real_acct.txt', delimiter = "\t")
index_list = []
print(type(index_list))
#--------------------------------------
for col in data:
    df_loop = data[data[col].astype(str).str.contains("SOME VAL", na = False)] # find the key word in each column on the loop
    print(type(index_list)) # this is so I can see what the type of the index list is
    print('col loop') # this lets me know that I am in the column loop
    print('------------') 
    if(df_loop.shape[0] > 0):
        list_ind = list(df_loop.index.values)
        print('shape is greater than 1x1') # df_loop found a value that contains the text
        print(type(index_list)) # check the type after we find something we would like to append
        print('-----------') # break the text up
        if(len(list_ind) != 0):
            try:   
                for i in range(len(list_ind)):
                    print('loop i')
                    index_list.append(int(list_ind[i]))
            except AttributeError:
                print('the end is nigh') # I like Watchmen
                try:   
                    for i in range(len(list_ind)):
                        print('loop j')
                        print(type(index_list))
                        index_list.insert(int(list_ind[i]))
                except AttributeError:
                    print('the end') # break if error
                    break
print(index_list)

After I run this, I get the following output. (I apologize for the length of this but the df I am searching through has 1507524 rows and 71 columns.

<class 'list'>
col loop
------------
<class 'list'>
col loop
------------
<class 'list'>
col loop
------------
shape is greater than 1x1
<class 'list'>
-----------
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
loop i
<class 'numpy.ndarray'>
col loop
------------
<class 'numpy.ndarray'>
col loop
------------
<class 'numpy.ndarray'>
col loop
------------
<class 'numpy.ndarray'>
col loop
------------
<class 'numpy.ndarray'>
col loop
------------
<class 'numpy.ndarray'>
col loop
------------
<class 'numpy.ndarray'>
col loop
------------
<class 'numpy.ndarray'>
col loop
------------
<class 'numpy.ndarray'>
col loop
------------
<class 'numpy.ndarray'>
col loop
------------
<class 'numpy.ndarray'>
col loop
------------
shape is greater than 1x1
<class 'numpy.ndarray'>
-----------
loop i
the end is nigh
loop j
<class 'numpy.ndarray'>
dough
[  61708   99779  118192  118193  164900  164901  210027  210030  210031
  232259  255711  379029  379030  496978  497010  497011  578759  852625
  941357  941359  941363  941375 1029671 1136526 1212745 1315677 1328337
 1333935 1342118 1401022 1462777 1462778 1462779 1462781]

I've tried to fix this; however, I've had no luck so far. Any help that can assist me in wrapping my brain around this would be much appreciated.

CodePudding user response：

Figured it out after seeing your picture. pd.unique() returns an numpy.ndarray. Use list(set(index_list)) instead or move it out of your outermost for loop.

As AJ Biffle pointed out, you're using insert instead of append in your j loop which is causing an error because insert takes two arguments (the index to insert the object and the object). You should also try to avoid looping through dataframes.
I know this doesn't answer the question (why it's changing) but this should get the desired output. There's most likely a better way to do this but I'm not a dataframe expert.

def check_for_keyword(x):
    keywords = []  # Add keywords to this list
    for keyword in keywords:
        if keyword.lower() in str(x).lower():  # Test this to see if x can be converted to string every time
            return True
    return False

data = data.applymap(lambda x: check_for_keyword(x)).any(axis=1)
row_indices = list(data.iloc[list(data)].index)

Essentially applymap goes through every value in the dataframe and changes the value to whatever scalar value is returned from the passed in function. In this case it returns True if the keyword is found in the value and False otherwise. any with axis=1 goes row by row and checks if any of the column values are True, and returns a 1 column Series of booleans. Then those values are turned into a list and used with iloc to only grab the rows that had a value of True (rows that had a column that contained a keyword). It then grabs the indices of the rows and converts them to a list.

Another option (since column index does not matter) would be to use open

keywords = []
row_indices = []
with open('./real_acct.txt') as f:
    for index, line in enumerate(f):
        for keyword in keywords:
            if keyword.lower() in line.lower():
                row_indices.append(index)
                break  # row contains at least one keyword, no need to check for more