I am trying to select multiple values between a range fom all rows per each column and plot them all-together.
The values in the dataframe are between 0 and 100. I want to select a range of values between 0 to 10 for all rows of one column, and then repeat that iteration every 10 values until 100 (e.g.: values between 0 to 10: 2, 4, 6, 9, 1 and then 10 to 20, 20 to 30, etc.) for each column.
After importing data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randint(0, 100, size = (100, 10)), columns=list('ABCDEFGHIJ'))
I am aware that I can select rows based on multiple column conditions with the following function
df_1 = df.A[(df.A > 0) & (df.A < 10)]
However, by doing it this way I can only select one range of values per one column at a time.
Is there a better way how to do this for a full range of values (0 to 100 every 10 iterations) and for all columns, rather than doing it manually for every range for all columns? If done manually, I need to set up 10 conditions per column and since there are 10 columns in the dataframe, it would end up with 100 conditions, which I wish to omit if possible.
I am also interested if there is a counter library that can do this kind of operation, just to provide an output of how many rows are between 0 to 10 every 10 iterations for each column.
CodePudding user response:
Try :
df_1 = df[(df['A']>0) & (df['A']<10)]
You will get :
>>> df_1
A B C D E F G H I J
43 5 91 98 63 55 32 6 79 28 18
47 3 88 62 6 52 21 16 64 33 60
50 8 43 84 6 8 6 70 93 0 95
65 5 24 7 80 89 92 70 65 12 44
78 2 99 15 14 5 46 99 62 33 40
80 5 66 86 22 57 86 15 44 78 37
95 2 36 70 53 81 76 70 2 9 42
CodePudding user response:
Try this; pd.cut ==> cut your column to bins and you can give any label name between this numbers. For example with this code , every number 0 to 10 will be 1, 11 to 20 will be 2... etc
for i in df.columns:
bins = [0,10,20,30,40,50,60,70,80,90,100]
labels = [1,2,3,4,5,6,7,8,9]
df[i] = pd.cut(df[i], bins=bins, labels=labels)
CodePudding user response:
Here is a small example which you can use for your purpose by modifying the values of rows
to 100 and num_group
to 10:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
# generate data
rows = 6
rng = np.random.default_rng(1)
df = pd.DataFrame(rng.integers(0, 100, size = (rows, 10)), columns=list('ABCDEFGHIJ'))
print(df)
num_group = 3
print(df.index//num_group)
# Counter to track of the number of groups visited
i = 0
def count(x):
global i
num_rows = []
for col in x.columns:
num_rows.append(np.sum(x[col].between(i*10, (i 1)*10, inclusive="left")))
i = 1
return pd.Series(dict(zip(list('ABCDEFGHIJ'), num_rows)))
print(df.groupby(df.index // num_group).apply(lambda x: count(x)))
This gives:
# df
A B C D E F G H I J
0 47 51 75 95 3 14 82 94 24 31
1 86 42 27 82 25 40 64 54 8 2
2 86 75 83 53 81 32 45 78 12 30
3 12 45 97 13 38 40 90 20 50 26
4 1 75 6 28 49 48 11 98 74 96
5 9 72 29 54 92 27 72 16 32 96
# df.index // num_group
Int64Index([0, 0, 0, 1, 1, 1], dtype='int64')
and the expected output for counting the number of rows whose value in a particular column lie in the range of [0, 10) and [10, 20) for the first and second groups, respectively:
A B C D E F G H I J
0 0 0 0 0 1 0 0 0 1 1
1 1 0 0 1 0 0 1 1 0 0
To understand how grouping works on rows, please see this answer or pandas.groupby
. To check if a value at a particular index of pandas Series
lies within a particular range, we have used pandas.Series.between
.
Most importantly, we have created a function count
that takes the grouped data frame, and for each column, gives us the count of rows that lie within a particular range. Finally, as num_rows
is just a simple list, we convert it to pandas.Series
before returning the output to pandas.apply
.
So, the answer will be:
df.groupby(df.index // num_group).apply(
lambda x:
pd.Series(dict(zip(
list('ABCDEFGHIJ'),
x.apply(lambda x:
np.sum(x.between((x.index//num_group) * 10, ((x.index//num_group) 1) * 10)))))))
Please be aware that the number of groups is equal to the unique integers in df.index // num_group
and not num_group
.