Home > Software engineering >  Compare any given two values in a top X from a DataFrame column by their index (position) in pandas
Compare any given two values in a top X from a DataFrame column by their index (position) in pandas

Time:05-11

I need to know if e.g. first value compared to third value is bigger or not in a given column in pandas DataFrame using Python.

I did this already in simple Python language and I am gonna post the full code below in order to illustrate better what I want to do:

# given array or list: (supposing this is a pandas DataFrame column)
mylst=[23, 18, 15, 14, 19, 28, 37, 29, 99]

print(mylst)

# get top 5 values:
print(sorted(mylst, reverse=True))
print(sorted(mylst, reverse=True)[:5])
top5 = sorted(mylst, reverse=True)[:5]

# all top 5 values in the initial order as they are found in mylst:
mytops = [x for x in mylst if x in top5]

print(mytops)

mytops contains ONLY top 5 values from mylst and they are not sorted in any order so they preserve their initial order/indexes so printing mytops will output: [23, 28, 37, 29, 99] as they are found in the original mylst so you know that 23 is the first value, 28 is the second, 37 and the third value and so one to the end of mytops list.

Now, as I've said in the first initially

I need to know if e.g. first value compared to third value is bigger or not

In native Python just need for this example to compare 23 to 37.

if mytops[0]>mytops[2]:
   # do something...
else:
   # do something else...

In pandas this will be something like:

df['new column that will contain the comparison results'] = np.where(condition,'value if true','value if false')

which comes into this:

df['first_vs_third'] = np.where(mytops[0]>mytops[2],1,0) #supposing that mytops[0]>mytops[2] works.

My questions are:

Supposing I have a huge DataFrame (and mylst from my code example is a pandas column with unique values), how to do this in pandas in a very fast and efficient way?

How to parameterize mytops[0]>mytops[2] and all that native Python code?

How to get top X values from column without the need to use native Python code as in example above?

What if mylst values are not unique?

How the code will be in this case?

Thank you in advance!

CodePudding user response:

The pandas equivalent of you code would be:

mylst = [23, 18, 15, 14, 19, 28, 37, 29, 99]
s = pd.Series(mylst)

s2 = s[s.isin(s.nlargest(5))]

if s.iloc[2]>s.iloc[0]:
    print('do something')
else:
    print('do something else')

For a vectorial comparison of indices 2 vs 0, then 3 vs 1, etc. you can use:

s2.diff(2)[2:].lt(0).to_list()
# [False, False, False]

To compare all combinations:

a = s2.to_numpy()
a>a[:,None]

array([[False,  True,  True,  True,  True],
       [False, False,  True,  True,  True],
       [False, False, False, False,  True],
       [False, False,  True, False,  True],
       [False, False, False, False, False]])

# or
pd.DataFrame(a>a[:,None], index=s2, columns=s2)

       23     28     37     29     99
23  False   True   True   True   True
28  False  False   True   True   True
37  False  False  False  False   True
29  False  False   True  False   True
99  False  False  False  False  False
  • Related