Find set of rows with most instances of certain features in a column-CodePudding

I have tables of the following form:

word	count	feature
my	0	pronoun
favorite	0	preferences
food	0	object
is	0	being
ice	0	dessert
cream	1	dessert

Each table is thousands of rows long. My goal is to find the top 3 sets of 100 rows in the table with the highest counts of a set of features in column 3. For example, I want to be able to say, "I want to know which 3 sets of 100 rows have very high amounts of both "dessert" and "object" in column 3." The rows are not in pre-set chunks: it could be rows 0-99 or 54-154. The output should be a set of row indices (e.g., 4-104).

I'm completely lost on how to do this, without creating some massive loop over every possible set of 100 rows and counting up the values in that, which seems inefficient. I'm suspecting there is a built in function of some kind that does this, but I can't figure out what. Any thoughts?

CodePudding user response：

Have you tried with groupby() function? In this case:

In [1]: df.groupby(["feature", "word"]).size()
Out[2]: word     feature
        dessert   ice       1
                  cream     1
        food      object    1 
        dtype: int64

CodePudding user response：

First off, use the pandas library. It contains vectorized implementations of many functions you'd end up using, so it'll be much faster than looping over hundreds of rows.

First, read the csv file into a pandas dataframe:

df = pd.read_csv('csv_file.csv')

With your given example, this yields a dataframe that looks like so:

       word  count      feature
0        my      0      pronoun
1  favorite      0  preferences
2      food      0       object
3        is      0        being
4       ice      0      dessert
5     cream      1      dessert

Now, define a function that takes a row, and counts the occurrences of a keyword in the subsequent 100 rows:

def count_in_next_100(row, keyword):
    row_index = row.name # Since the index is numeric, row.name will be the row number
    # Take the feature column for the next 100 rows
    # Check which of these are == keyword, which will give a bunch of True/False
    # Then take their .sum(), so you get the number that are True
    total = (df.loc[row_index:row_index 100, "feature"] == keyword).sum()
    return total # Return this value.

Next, apply this function to your dataframe for each row, i.e. axis=1

count_dessert = df.apply(count_in_next_100, axis=1, args=("dessert",))

Then, count_dessert.idxmax() will give you the row number which has the maximum occurrences of dessert in the subsequent 100 rows. I'm going to leave the "find top 3" part as an exercise for you, but let me know if you need help with it.