Home > OS >  Select rows between multiple values for all columns in a dataframe
Select rows between multiple values for all columns in a dataframe

Time:11-18

I am trying to select multiple values between a range fom all rows per each column and plot them all-together.

The values in the dataframe are between 0 and 100. I want to select a range of values between 0 to 10 for all rows of one column, and then repeat that iteration every 10 values until 100 (e.g.: values between 0 to 10: 2, 4, 6, 9, 1 and then 10 to 20, 20 to 30, etc.) for each column.

After importing data

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


df = pd.DataFrame(np.random.randint(0, 100, size = (100, 10)), columns=list('ABCDEFGHIJ'))

I am aware that I can select rows based on multiple column conditions with the following function

df_1 = df.A[(df.A > 0) & (df.A < 10)]

However, by doing it this way I can only select one range of values per one column at a time.

Is there a better way how to do this for a full range of values (0 to 100 every 10 iterations) and for all columns, rather than doing it manually for every range for all columns? If done manually, I need to set up 10 conditions per column and since there are 10 columns in the dataframe, it would end up with 100 conditions, which I wish to omit if possible.

I am also interested if there is a counter library that can do this kind of operation, just to provide an output of how many rows are between 0 to 10 every 10 iterations for each column.

CodePudding user response:

Try :

df_1 = df[(df['A']>0) & (df['A']<10)]

You will get :

>>> df_1

    A   B   C   D   E   F   G   H   I   J
43  5   91  98  63  55  32  6   79  28  18
47  3   88  62  6   52  21  16  64  33  60
50  8   43  84  6   8   6   70  93  0   95
65  5   24  7   80  89  92  70  65  12  44
78  2   99  15  14  5   46  99  62  33  40
80  5   66  86  22  57  86  15  44  78  37
95  2   36  70  53  81  76  70  2   9   42

CodePudding user response:

Try this; pd.cut ==> cut your column to bins and you can give any label name between this numbers. For example with this code , every number 0 to 10 will be 1, 11 to 20 will be 2... etc

for i in df.columns:
    bins = [0,10,20,30,40,50,60,70,80,90,100]
    labels = [1,2,3,4,5,6,7,8,9]    
    df[i] = pd.cut(df[i], bins=bins, labels=labels)

CodePudding user response:

Here is a small example which you can use for your purpose by modifying the values of rows to 100 and num_group to 10:

import pandas as pd
import numpy as np 
pd.set_option('display.max_rows', 500)

# generate data
rows = 6
rng = np.random.default_rng(1)
df = pd.DataFrame(rng.integers(0, 100, size = (rows, 10)), columns=list('ABCDEFGHIJ'))

print(df)

num_group = 3
print(df.index//num_group)

# Counter to track of the number of groups visited
i = 0
def count(x):
    global i
    num_rows = []
    for col in x.columns:
        num_rows.append(np.sum(x[col].between(i*10, (i 1)*10, inclusive="left")))
    i  = 1
    return pd.Series(dict(zip(list('ABCDEFGHIJ'), num_rows)))

print(df.groupby(df.index // num_group).apply(lambda x: count(x)))

This gives:

# df
    A   B   C   D   E   F   G   H   I   J
0  47  51  75  95   3  14  82  94  24  31
1  86  42  27  82  25  40  64  54   8   2
2  86  75  83  53  81  32  45  78  12  30
3  12  45  97  13  38  40  90  20  50  26
4   1  75   6  28  49  48  11  98  74  96
5   9  72  29  54  92  27  72  16  32  96
# df.index // num_group
Int64Index([0, 0, 0, 1, 1, 1], dtype='int64')

and the expected output for counting the number of rows whose value in a particular column lie in the range of [0, 10) and [10, 20) for the first and second groups, respectively:

   A  B  C  D  E  F  G  H  I  J
0  0  0  0  0  1  0  0  0  1  1
1  1  0  0  1  0  0  1  1  0  0

To understand how grouping works on rows, please see this answer or pandas.groupby. To check if a value at a particular index of pandas Series lies within a particular range, we have used pandas.Series.between.

Most importantly, we have created a function count that takes the grouped data frame, and for each column, gives us the count of rows that lie within a particular range. Finally, as num_rows is just a simple list, we convert it to pandas.Series before returning the output to pandas.apply.

So, the answer will be:

df.groupby(df.index // num_group).apply(
    lambda x: 
    pd.Series(dict(zip(
        list('ABCDEFGHIJ'), 
        x.apply(lambda x: 
            np.sum(x.between((x.index//num_group) * 10, ((x.index//num_group)  1) * 10)))))))

Please be aware that the number of groups is equal to the unique integers in df.index // num_group and not num_group.

  • Related