Home > Net >  How to extract the 10 biggest peak values in each row of a pandas dataframe (and the column name eac
How to extract the 10 biggest peak values in each row of a pandas dataframe (and the column name eac

Time:07-28

How to extract the 10 biggest peak values in each row of a pandas dataframe (and the column name each value resides in)?

I have an idea for how this might work in theory. Unfortunately, after trying to implement this in various ways (iteritems, iterrows, itertuples, loc, iloc, combinations of these, identifying number of rows and columns then using for x in range(columns or rows)) I am still struggling with the how to iterate through each value of each row, and extracting both the value I want and the column name that the value resides in.

Example dataframe

frequency_bin_1 frequency_bin_2 frequency_bin_3 frequency_bin_4 frequency_bin_5
0.2222222 0.33333333 0.44444444 0.22222222 0.11111111
0.11111111 0.2222222 0.33333333 0.44444444 0.33333333
0.11111111 0.2222222 0.11111111 0.22222222 0.11111111
0.11111111 0.2222222 0.11111111 0.22222222 0.33333333
0.11111111 0.2222222 0.11111111 0.44444444 0.33333333

So for the above dataframe I would want to extract both the value and the column name for:

  • row 1 peak/s: frequency_bin_3
  • row 2 peak/s: frequency_bin_4
  • row 3 peak/s: frequency_bin_2, frequency_bin_4
  • row 4 peak/s: frequency_bin_2, frequency_bin_5
  • row 5 peak/s: frequency_bin_2, frequency_bin_4

I do have an idea of how this code might flow...

Idea for code flow

"""
    iterate through each row of dataframe
        iterate through each value in the row
            set first value as 'current'
            if next value is greater than 'current'
                just replace 'current' value with next value
            if next value is smaller than 'current'
                append value of 'current' to list peak_values
                append column name of 'current' to list of peak_columns
                    (side note these two lists could be a different data structure, 
                    whatever is easiest for the next steps, possibly a dictionary?)
            when finished iterating through a row
                keep only 10 biggest peak values and their corresponding column name
                append lists peak_values and peak_columns to end of dataframe row
                    (alternating, column name, column value, column name, column value, etc.)
"""

I am now at a point where I have become confused and overwhelmed. Perhaps I may need to complete the iteration in a different data structure? Maybe extracting each row in to a list, iterate through that list to extract the values I want, then append those values back on to the dataframe?

The reason for using a dataframe is to generate a CSV file of the data for further analysis.

It should be noted that I am very new to programming and trying to find my feet so any suggestions will really be appreciated. Thanks!

CodePudding user response:

I would try something like this: largest_row_1 = df.nlargest(10, ['frequency_bin_3'])

CodePudding user response:

You could try the following (df your dataframe):

peaks = (df.shift(-1, axis=1) < df) & (df.shift(1, axis=1) < df)
peaks.iloc[:, 0] = df.iloc[:, 0] > df.iloc[:, 1]
peaks.iloc[:, -1] = df.iloc[:, -1] > df.iloc[:, -2]
res = peaks.apply(lambda row: list(row[row].index), axis=1)
  • Finding the peaks in the non-edge columns of df (i.e., not in first and last column) via comparison with the shift()-ed columns.
  • Identifying the peaks in the edge columns of df (first and last).
  • Extracting the corresponding column names.

Result (res):

0                     [frequency_bin_3]
1                     [frequency_bin_4]
2    [frequency_bin_2, frequency_bin_4]
3    [frequency_bin_2, frequency_bin_5]
4    [frequency_bin_2, frequency_bin_4]
dtype: object

With peaks as above you could do:

def get_peaks(row):
    peaks = row[row.notna()]
    peak_pairs = sorted(zip(peaks.index, peaks), key=lambda t: t[1], reverse=True)
    return [item for pair in peak_pairs[:10] for item in pair]

df["top_peaks"] = df[peaks].apply(get_peaks, axis=1)
  • df[peaks]: setting all but the peaks to NaN.
  • In get_peaks:
    • Extract the peaks from the row (the non-NaN values).
    • Sort the corresponding column name-value-pairs by value, reversed, i.e. the highest first.
    • Take at most 10 of the pairs and put them flattened into a list.
  • Apply get_peaks and put the result in a new df-column top_peaks.

Result:

   frequency_bin_1  ...                                                  top_peaks
0         0.222222  ...                              [frequency_bin_3, 0.44444444]
1         0.111111  ...                              [frequency_bin_4, 0.44444444]
2         0.111111  ...  [frequency_bin_4, 0.22222222, frequency_bin_2, 0.2222222]
3         0.111111  ...  [frequency_bin_5, 0.33333333, frequency_bin_2, 0.2222222]
4         0.111111  ...  [frequency_bin_4, 0.44444444, frequency_bin_2, 0.2222222]

The new column top_peaks should contain pretty much what you are looking for (if I understand correctly):

... keep only 10 biggest peak values and their corresponding column name, append lists peak_values and peak_columns to end of dataframe row (alternating, column name, column value, column name, column value, etc.)

  • Related