I want to count the number of elements in a row that fall within a range and then print a new column with the results. After looking around I came up with the following solution, however the results are not consistent. Is the solution too simplistic for what I want to accomplish? I have the following DataFrame.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 3)), columns=list('ABC'))
def val_1(row):
val_1 = 0
if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(1,3):
val_1 = 3
elif row.loc['A'] and row.loc['B'] in range(1,3):
val_1 = 2
elif row.loc['A'] in range(1,3):
val_1 = 1
return val_1
def val_2(row):
val_2 = 0
if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(3,6) :
val_2 = 3
elif row.loc['A'] and row.loc['B'] in range(3,6) :
val_2 = 2
elif row.loc['A'] in range(3,6) :
val_2 = 1
return val_2
def val_3(row):
val_3 = 0
if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(6,10) :
val_3 = 3
elif row.loc['A'] and row.loc['B'] in range(6,10) :
val_3 = 2
elif row.loc['A']in range(6,10) :
val_3 = 1
return val_3
def results():
df['Val_1'] = df.apply(val_1, axis=1)
df['Val_2'] = df.apply(val_2, axis=1)
df['Val_3'] = df.apply(val_3, axis=1)
print(df)
results()
A B C Val_1 Val_2 Val_3
0 9 0 0 0 0 1
1 6 1 0 2 0 1
2 8 5 5 0 3 1
3 9 7 0 0 0 2
4 4 6 2 3 1 2
5 1 5 5 1 3 0
6 8 1 7 2 0 3
7 4 8 5 0 3 2
8 0 6 0 0 0 0
9 3 0 3 0 1 0
Thanks for your help.
CodePudding user response:
This is shorter version of your code. The problem is in your code:
- It is checking the range only for last element
row.loc['A'] and row.loc['B'] and row.loc["C"] in range(start,end)
- Only checking range logic for
C
column (Not forA
&B
)
def get_val(row,start,end):
val = 0
if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(start,end) :
val = 3
elif row.loc['A'] and row.loc['B'] in range(start,end) :
val = 2
elif row.loc['A'] in range(start,end) :
val = 1
return val
df = pd.DataFrame(np.random.randint(0,10,size=(10, 3)), columns=list('ABC'))
df['Val_1'] = df.apply(lambda x:get_val(x,1,3), axis=1)
df['Val_2'] = df.apply(lambda x:get_val(x,3,6), axis=1)
df['Val_3'] = df.apply(lambda x:get_val(x,6,10), axis=1)
My Solution
def query_build(start, end):
query1 = f'A>={start} and A<={end} and B>={start} and B<={end} and C>={start} and C<={end}'
query2 = f'A>={start} and A<={end} and B>={start} and B<={end}'
query3 = f'A>={start} and A<={end}'
return {query1: 3,
query2: 2,
query3: 1}
df = pd.DataFrame(np.random.randint(0,10,size=(10, 3)), columns=list('ABC'))
df['val1'] = 0
df['val2'] = 0
df['val3'] = 0
val_range = {'val1':(1,2),'val2':(3,5),'val3':(6,9)}
for name, r_range in val_range.items():
query_set = query_build(*r_range)
for query, val in query_set.items():
length = len(df.query(query))
if length:
df[name][df.query(query).index] = val
print(df)
CodePudding user response:
Another approach which is using numpy.select
(recommended in the Pandas User Guide here):
import numpy as np
for n, (start, stop) in enumerate([(1, 3), (3, 6), (6, 10)], start=1):
m = df.isin(range(start, stop))
condlist = [m.all(axis=1), m[["A", "B"]].all(axis=1), m["A"]]
df[f"Val_{n}"] = np.select(condlist, [3, 2, 1])
- For each iteration build a mask
m
ondf
that checks if thedf
values are in the resp.range(start, stop)
. - Based on
m
build a condition listcondlist
: The 1. entry checks if all values are in the range, the 2. checks if the values in columnsA
andB
are in the range, and the 3. checks if the value in columnA
is in the range - all checks are done row-wise. - Based on
condlist
set the corresponding values from[3, 2, 1]
(calledchoicelist
in thenumpy.select
docs) in the new column, and0
(standarddefault
) if no condition is met.
The selection follows the preferences in you code - see the numpy.select
documentation:
When multiple conditions are satisfied, the first one encountered in condlist is used.
Your question actually sounds a bit different to what you are doing:
I want to count the number of elements in a row that fall within a range and then print a new column with the results.
If that is your real goal, then you could try something simpler:
for n, (start, stop) in enumerate([(1, 3), (3, 6), (6, 10)], start=1):
df[f"Val_{n}"] = df.isin(range(start, stop)).sum(axis=1)
Why your approach fails (in addition to @Mazhar's explanation): This
if row.loc['A'] and row.loc['B'] and row.loc["C"] in range(1,3):
isn't how logical operators like and
work: You have to fully specify each part of the condition, like:
if (row.loc['A'] in range(1,3)) and (row.loc['B'] in range(1,3)) and (row.loc["C"] in range(1,3)):
The way you have used it actually resolves to
if bool(row.loc['A']) and bool(row.loc['B']) and row.loc["C"] in range(1,3):
which here means (since the values are numbers)
if (row.loc['A'] != 0) and (row.loc['B'] != 0) and (row.loc["C"] in range(1,3)):