Home > database >  How to find 3 similar numbers in a column of a data frame in python pandas
How to find 3 similar numbers in a column of a data frame in python pandas

Time:03-26

numbers = ['1.46', '1.59', '1.43', '1.42', '1.45', '1.65', '1.35', '1.39', '1.55', '1.88', '1.43']

All I want is to get the list of the 3 closest numbers to each other.

In this case the numbers would be 1.43 1.42 1.43.

I can not find any help anywhere. Some people use an input for example

test = nsmallest(3, price, key=lambda x: abs(x - 1.42))

but I don't want to put an input.

CodePudding user response:

I'd get all fancy with this using pandas:

s = pd.Series([float(i) for i in numbers])
ss = s.sort_values()
idx = ss.rolling(3).apply(lambda x: abs(x.iloc[0]-x.iloc[2])).idxmin()
i = ss.index.get_loc(idx)
ss.iloc[i-2:i 1].to_numpy()

Output:

array([1.42, 1.43, 1.43])

CodePudding user response:

Pandas is not nescessarily the best framework to do it but it is possible

numbers = pd.Series(numbers)
numbers = numbers.astype('float')
numbers = numbers.sort_values()
numbers = numbers.reset_index(drop=True)
smallest_index = numbers.diff(2).idxmin()
numbers.loc[smallest_index-2:smallest_index].values

CodePudding user response:

This works for me

def mostSimilar(numbers):
    sorted_array = numbers.copy()
    sorted_array.sort()
    diff = float(sorted_array[len(sorted_array)-1]) - float(sorted_array[0])
    most_similar_values = None
    for i in range(len(numbers)-3):
        tmpDiff = float(sorted_array[i 1])-float(sorted_array[i])   float(sorted_array[i 2])-float(sorted_array[i 1])
        if tmpDiff < diff:
            diff = tmpDiff
            most_similar_values = (sorted_array[i], sorted_array[i 1], sorted_array[i 2])
    return most_similar_values

CodePudding user response:

You can use this- see below for an alternative solution (same logic, just pure pandas method chaining fun) as well as an explanation of the logic.

window_size = 3
sorted_numbers = pd.Series(numbers).astype(float).sort_values()
mingroup_right = sorted_numbers.diff(window_size-1).argmin()   1
out = sorted_numbers.iloc[mingroup_right-window_size : mingroup_right]

print(out)
3     1.42
2     1.43
10    1.43
dtype: float64

Alternatively, this one is for the pandas method chaining addicts out there:

window_size = 3
out = (
    pd.Series(numbers)
    .astype(float)
    .sort_values()
    .iloc[lambda s: 
        slice( 
            min_pos := s.diff(window_size-1).argmin() - window_size   1,  
            min_pos   window_size
        )
    ]
)

print(out)
3     1.42
2     1.43
10    1.43
dtype: float64

The logic:

  • window_size → the number of adjacent floats we want to compare at one time
  • coerce all values to floats
  • sort them to move adjacent values near eachotheer
  • diff(window_size-1) will subtract the first and last values in each group of size window_size.
    • Finding the minimum values along this output yields the position of the group whose values are all near each other
  • use argmin to get the position of the minimum diff value, then offset that by window_size to get the positions of the range of values and extract the corresponding slice
  • .iloc pairs with our argmin() based slice to extract the group from the original array
  • Related