How to extract the 10 biggest peak values in each row of a pandas dataframe (and the column name each value resides in)?
I have an idea for how this might work in theory. Unfortunately, after trying to implement this in various ways (iteritems, iterrows, itertuples, loc, iloc, combinations of these, identifying number of rows and columns then using for x in range(columns or rows))
I am still struggling with the how to iterate through each value of each row, and extracting both the value I want and the column name that the value resides in.
Example dataframe
frequency_bin_1 | frequency_bin_2 | frequency_bin_3 | frequency_bin_4 | frequency_bin_5 |
---|---|---|---|---|
0.2222222 | 0.33333333 | 0.44444444 | 0.22222222 | 0.11111111 |
0.11111111 | 0.2222222 | 0.33333333 | 0.44444444 | 0.33333333 |
0.11111111 | 0.2222222 | 0.11111111 | 0.22222222 | 0.11111111 |
0.11111111 | 0.2222222 | 0.11111111 | 0.22222222 | 0.33333333 |
0.11111111 | 0.2222222 | 0.11111111 | 0.44444444 | 0.33333333 |
So for the above dataframe I would want to extract both the value and the column name for:
- row 1 peak/s: frequency_bin_3
- row 2 peak/s: frequency_bin_4
- row 3 peak/s: frequency_bin_2, frequency_bin_4
- row 4 peak/s: frequency_bin_2, frequency_bin_5
- row 5 peak/s: frequency_bin_2, frequency_bin_4
I do have an idea of how this code might flow...
Idea for code flow
"""
iterate through each row of dataframe
iterate through each value in the row
set first value as 'current'
if next value is greater than 'current'
just replace 'current' value with next value
if next value is smaller than 'current'
append value of 'current' to list peak_values
append column name of 'current' to list of peak_columns
(side note these two lists could be a different data structure,
whatever is easiest for the next steps, possibly a dictionary?)
when finished iterating through a row
keep only 10 biggest peak values and their corresponding column name
append lists peak_values and peak_columns to end of dataframe row
(alternating, column name, column value, column name, column value, etc.)
"""
I am now at a point where I have become confused and overwhelmed. Perhaps I may need to complete the iteration in a different data structure? Maybe extracting each row in to a list, iterate through that list to extract the values I want, then append those values back on to the dataframe?
The reason for using a dataframe is to generate a CSV file of the data for further analysis.
It should be noted that I am very new to programming and trying to find my feet so any suggestions will really be appreciated. Thanks!
CodePudding user response:
I would try something like this:
largest_row_1 = df.nlargest(10, ['frequency_bin_3'])
CodePudding user response:
You could try the following (df
your dataframe):
peaks = (df.shift(-1, axis=1) < df) & (df.shift(1, axis=1) < df)
peaks.iloc[:, 0] = df.iloc[:, 0] > df.iloc[:, 1]
peaks.iloc[:, -1] = df.iloc[:, -1] > df.iloc[:, -2]
res = peaks.apply(lambda row: list(row[row].index), axis=1)
- Finding the peaks in the non-edge columns of
df
(i.e., not in first and last column) via comparison with theshift()
-ed columns. - Identifying the peaks in the edge columns of
df
(first and last). - Extracting the corresponding column names.
Result (res
):
0 [frequency_bin_3]
1 [frequency_bin_4]
2 [frequency_bin_2, frequency_bin_4]
3 [frequency_bin_2, frequency_bin_5]
4 [frequency_bin_2, frequency_bin_4]
dtype: object
With peaks
as above you could do:
def get_peaks(row):
peaks = row[row.notna()]
peak_pairs = sorted(zip(peaks.index, peaks), key=lambda t: t[1], reverse=True)
return [item for pair in peak_pairs[:10] for item in pair]
df["top_peaks"] = df[peaks].apply(get_peaks, axis=1)
df[peaks]
: setting all but the peaks toNaN
.- In
get_peaks
:- Extract the peaks from the row (the non-
NaN
values). - Sort the corresponding column name-value-pairs by value, reversed, i.e. the highest first.
- Take at most 10 of the pairs and put them flattened into a list.
- Extract the peaks from the row (the non-
- Apply
get_peaks
and put the result in a newdf
-columntop_peaks
.
Result:
frequency_bin_1 ... top_peaks
0 0.222222 ... [frequency_bin_3, 0.44444444]
1 0.111111 ... [frequency_bin_4, 0.44444444]
2 0.111111 ... [frequency_bin_4, 0.22222222, frequency_bin_2, 0.2222222]
3 0.111111 ... [frequency_bin_5, 0.33333333, frequency_bin_2, 0.2222222]
4 0.111111 ... [frequency_bin_4, 0.44444444, frequency_bin_2, 0.2222222]
The new column top_peaks
should contain pretty much what you are looking for (if I understand correctly):
... keep only 10 biggest peak values and their corresponding column name, append lists peak_values and peak_columns to end of dataframe row (alternating, column name, column value, column name, column value, etc.)