Home > Software design >  Get row count in DataFrame without for loop
Get row count in DataFrame without for loop

Time:09-29

I need to find if the last value of dataframe['position'] is different from 0, then count the previous (so in reverse) values until them changes and store the counted index before the change in a variable, this without for loops. By loc or iloc for example...

dataframe:
  | position |
0          1
1          0  
2          1  <4 
3          1  <3
4          1  <2
5          1  <1

count = 4  

I achieved this by a for loop, but I need to avoid it:

count = 1
if data['position'].iloc[-1] != 0:
   for i in data['position']:
      if data['position'].iloc[-count] == data['position'].iloc[-1]:
         count = count   1 
      else:
         break
   if data['position'].iloc[-count] != data['position'].iloc[-1]:
      count = count - 1 

CodePudding user response:

This should do the trick:

((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])).cumprod().sum()

This builds a condition ((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])) indicating whether the value in each row (counting backwards from the end) is nonzero and equals the last value. Then, the values are coerced into 0 or 1 and the cumulative product is taken, so that the first non-matching zero will break the sequence and all subsequent values will be zero. Then the flags are summed to get the count of these consecutive matched values.

Depending on your data, though, stepping iteratively backwards from the end may be faster. This solution is vectorized, but it requires working with the entire column of data and doing several computations which are the same size as the original series.

Example:

In [12]: data = pd.DataFrame(np.random.randint(0, 3, size=(10, 5)), columns=list('ABCDE'))
    ...: data
Out[12]:
   A  B  C  D  E
0  2  0  1  2  0
1  1  0  1  2  1
2  2  1  2  1  0
3  1  0  1  2  2
4  1  1  0  0  2
5  2  2  1  0  2
6  2  1  1  2  2
7  0  0  0  1  0
8  2  2  0  0  1
9  2  0  0  2  1

In [13]: ((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])).cumprod().sum()
Out[13]:
A    2
B    0
C    0
D    1
E    2
dtype: int64

CodePudding user response:

You can reverse your Series, convert to boolean using the target condition (here "not equal 0" with ne), and apply a cummin to propagate the False upwards and sum to count the trailing True:

count = df.loc[::-1, 'position'].ne(0).cummin().sum()

Output: 4

If you have multiple columns:

counts = df.loc[::-1].ne(0).cummin().sum()

alternative

A slightly faster alternative (~25% faster), but relying on the assumptions that you have at least one zero and non duplicated indices could be to find the last zero and use indexing

m = df['position'].eq(0)
count = len(df.loc[m[m].index[-1]:])-1

Without the requirement to have at least one zero:

m = df['position'].eq(0)
m = m[m]

count = len(df) if m.empty else len(df.loc[m.index[-1]:])-1
  • Related