I need to know if e.g. first value compared to third value is bigger or not in a given column in pandas DataFrame using Python.
I did this already in simple Python language and I am gonna post the full code below in order to illustrate better what I want to do:
# given array or list: (supposing this is a pandas DataFrame column)
mylst=[23, 18, 15, 14, 19, 28, 37, 29, 99]
print(mylst)
# get top 5 values:
print(sorted(mylst, reverse=True))
print(sorted(mylst, reverse=True)[:5])
top5 = sorted(mylst, reverse=True)[:5]
# all top 5 values in the initial order as they are found in mylst:
mytops = [x for x in mylst if x in top5]
print(mytops)
mytops contains ONLY top 5 values from mylst and they are not sorted in any order so they preserve their initial order/indexes so printing mytops will output: [23, 28, 37, 29, 99] as they are found in the original mylst so you know that 23 is the first value, 28 is the second, 37 and the third value and so one to the end of mytops list.
Now, as I've said in the first initially
I need to know if e.g. first value compared to third value is bigger or not
In native Python just need for this example to compare 23 to 37.
if mytops[0]>mytops[2]:
# do something...
else:
# do something else...
In pandas this will be something like:
df['new column that will contain the comparison results'] = np.where(condition,'value if true','value if false')
which comes into this:
df['first_vs_third'] = np.where(mytops[0]>mytops[2],1,0) #supposing that mytops[0]>mytops[2] works.
My questions are:
Supposing I have a huge DataFrame (and mylst from my code example is a pandas column with unique values), how to do this in pandas in a very fast and efficient way?
How to parameterize mytops[0]>mytops[2] and all that native Python code?
How to get top X values from column without the need to use native Python code as in example above?
What if mylst values are not unique?
How the code will be in this case?
Thank you in advance!
CodePudding user response:
The pandas equivalent of you code would be:
mylst = [23, 18, 15, 14, 19, 28, 37, 29, 99]
s = pd.Series(mylst)
s2 = s[s.isin(s.nlargest(5))]
if s.iloc[2]>s.iloc[0]:
print('do something')
else:
print('do something else')
For a vectorial comparison of indices 2 vs 0, then 3 vs 1, etc. you can use:
s2.diff(2)[2:].lt(0).to_list()
# [False, False, False]
To compare all combinations:
a = s2.to_numpy()
a>a[:,None]
array([[False, True, True, True, True],
[False, False, True, True, True],
[False, False, False, False, True],
[False, False, True, False, True],
[False, False, False, False, False]])
# or
pd.DataFrame(a>a[:,None], index=s2, columns=s2)
23 28 37 29 99
23 False True True True True
28 False False True True True
37 False False False False True
29 False False True False True
99 False False False False False