I have a very large dataframe with meterId retailPrice. I want to output only the meterIds when there are both positive and 0 values. My current query is very performance heavy and I could not check the correctness yet.
Is there a more performant way?
d = {'meterId': ["x", "x", "y", "y", "z", "z"], 'retailPrice': [1, 0, 0, 0, 1, 1]}
df = pd.DataFrame(data=d)
df
tmp = pd.DataFrame(df["meterId"])
for x in tmp["meterId"]:`
zero_values = pd.DataFrame(df.loc[(df['meterId'] == x) & df['retailPrice'] == 0)])
positive_values = pd.DataFrame(df.loc[(df['meterId'] == x) & (df['retailPrice'] > 0)])
if not zero_values.empty and not positive_values.empty:
print("meterID: " str(x))
My output should be like this:
"meterID: x"
CodePudding user response:
This also works:
for id_ in df.meterId.unique():
if (((df.retailPrice > 0) & (df.meterId == id_)).any()) & (
((df.retailPrice == 0) & (df.meterId == id_)).any()
):
print(f"meterID: {id_}")
But Mustafa Aydın's is more elegant.
CodePudding user response:
In [113]: pos = df.retailPrice > 0
In [114]: zer = df.retailPrice == 0
In [115]: pos.groupby(df.meterId).any() & zer.groupby(df.meterId).any()
Out[115]:
meterId
x True
y False
z False
Name: retailPrice, dtype: bool
In [116]: _[_].index.tolist()
Out[116]: ['x']
- get whether
pos
itive or not mask - get whether
zer
o or not mask - group these mask by ID and check if
any
value satisfies them - since need both pos and zero,
&
them; gives a True/False series for IDs as the desired result - index it with itself to retain only where True and get the index
_
is the last evaled thing, so i refer to the previous output with it. you can do s = ...
in 115 and do s[s].index.tolist()
CodePudding user response:
you can try:
#df
meterId retailPrice
0 x 1
1 x 0
2 y 0
3 y 0
4 z 1
5 z 1
zero_values_condition = df['meterId'].eq('x') & df['retailPrice'].eq(0) #eq() means ==
positive_values_condition = df['meterId'].eq('x') & df['retailPrice'].gt(0) #gt() means >
df['new_col'] = np.where((zero_values_condition|positive_values_condition), "meterID: x", '')
df
meterId retailPrice new_col
0 x 1 meterID: x
1 x 0 meterID: x
2 y 0
3 y 0
4 z 1
5 z 1
CodePudding user response:
here is another way , using group by
, let us know which approach was more performant :
res = df.groupby('meterId')['retailPrice'].apply(lambda x: (x > 0).sum() > 0 and (x == 0).sum() > 0)
out = res[res].index.tolist()