DataFrame: query column if it has negative and positve values-CodePudding

I have a very large dataframe with meterId retailPrice. I want to output only the meterIds when there are both positive and 0 values. My current query is very performance heavy and I could not check the correctness yet.

Is there a more performant way?

d = {'meterId': ["x", "x", "y", "y", "z", "z"], 'retailPrice': [1, 0, 0, 0, 1, 1]}
df = pd.DataFrame(data=d)
df

tmp = pd.DataFrame(df["meterId"])
for x in tmp["meterId"]:`
    zero_values = pd.DataFrame(df.loc[(df['meterId'] == x) & df['retailPrice'] == 0)])
    positive_values = pd.DataFrame(df.loc[(df['meterId'] == x) & (df['retailPrice'] > 0)])
    if not zero_values.empty and not positive_values.empty:
        print("meterID: "   str(x))

My output should be like this: "meterID: x"

CodePudding user response：

This also works:

for id_ in df.meterId.unique():
    if (((df.retailPrice > 0) & (df.meterId == id_)).any()) & (
        ((df.retailPrice == 0) & (df.meterId == id_)).any()
    ):
        print(f"meterID: {id_}")

But Mustafa Aydın's is more elegant.

CodePudding user response：

In [113]: pos = df.retailPrice >  0

In [114]: zer = df.retailPrice == 0

In [115]: pos.groupby(df.meterId).any() & zer.groupby(df.meterId).any()
Out[115]:
meterId
x     True
y    False
z    False
Name: retailPrice, dtype: bool

In [116]: _[_].index.tolist()
Out[116]: ['x']

get whether positive or not mask
get whether zero or not mask
group these mask by ID and check if any value satisfies them
since need both pos and zero, & them; gives a True/False series for IDs as the desired result
index it with itself to retain only where True and get the index

_ is the last evaled thing, so i refer to the previous output with it. you can do s = ... in 115 and do s[s].index.tolist()

CodePudding user response：

you can try:

#df
    meterId retailPrice
0   x       1
1   x       0
2   y       0
3   y       0
4   z       1
5   z       1

zero_values_condition      = df['meterId'].eq('x') & df['retailPrice'].eq(0) #eq() means ==
positive_values_condition  = df['meterId'].eq('x') & df['retailPrice'].gt(0) #gt() means >
df['new_col'] = np.where((zero_values_condition|positive_values_condition), "meterID: x", '')

df
    meterId retailPrice new_col
0   x       1           meterID: x
1   x       0           meterID: x
2   y       0   
3   y       0   
4   z       1   
5   z       1

CodePudding user response：

here is another way , using group by , let us know which approach was more performant :

res = df.groupby('meterId')['retailPrice'].apply(lambda x: (x > 0).sum() > 0 and (x == 0).sum() > 0)
out = res[res].index.tolist()