ntiles over columns in python using R's "mutate(across(cols = ..."-CodePudding

In R I have some data:

library(tidyverse)

d = data.frame(
  "A" = c(0.1, 0.76, 0.54, 0.79, 0.28),
    "B" = c(3, 2, 4, 3, 4),
    "C" = c(2, 2, 7, 3, 4),
    "D" = c(4, 3, 6, 12, 7)
)

Where I can apply across each column a computation to compute the ntiles:

d %>% 
  mutate(across(.cols = everything(),
              .fns = ~ ntile(., 5),
              .names = "{.col}_ntiles")
)

Which gives me the following output - Desired output:

     A B C  D A_ntiles B_ntiles C_ntiles D_ntiles
1 0.10 3 2  4        1        2        1        2
2 0.76 2 2  3        4        1        2        1
3 0.54 4 7  6        3        4        5        3
4 0.79 3 3 12        5        3        3        5
5 0.28 4 4  7        2        5        4        4

I am trying to do this same task in Python.

I managed to get as far as the following:

import pandas as pd

d = pd.DataFrame({"A":[0.1, 0.76, 0.54, 0.79, 0.28],
                   "B":[3, 2, 4, 3, 4],
                   "C":[2, 2, 7, 3, 4],
                   "D":[4, 3, 6, 12, 7]})

d.quantile([0.2, 0.4, 0.6, 0.8], axis = 0)

Which gives me:

    A   B   C   D
0.2     0.244   2.8     2.0     3.8
0.4     0.436   3.0     2.6     5.2
0.6     0.628   3.4     3.4     6.4
0.8     0.766   4.0     4.6     8.0

However, I would like to obtain which rank category each ntile belongs to - just as in the R example using A_ntiles, B_ntiles ... D_ntiles.

CodePudding user response：

You can check with rank

out = d.join(d.rank(method='first').add_suffix('_ntile').astype(int))

out
Out[92]: 
      A  B  C   D  A_ntile  B_ntile  C_ntile  D_ntile
0  0.10  3  2   4        1        2        1        2
1  0.76  2  2   3        4        1        2        1
2  0.54  4  7   6        3        4        5        3
3  0.79  3  3  12        5        3        3        5
4  0.28  4  4   7        2        5        4        4

CodePudding user response：

Update

rank_quantile = lambda x: pd.qcut(x, q=5, labels=False, duplicates='drop')
out = pd.concat([d, d.apply(rank_quantile).add(1).add_suffix('_nqtiles')], axis=1)
print(out)

# Output
      A  B  C   D  A_nqtiles  B_nqtiles  C_nqtiles  D_nqtiles
0  0.10  3  2   4          1          2          1          2
1  0.76  2  2   3          4          1          1          1
2  0.54  4  7   6          3          4          4          3
3  0.79  3  3  12          5          2          2          5
4  0.28  4  4   7          2          4          3          4

Old answer

Use rank with method='first' then concatenate your dataframe:

out = pd.concat([d, d.rank(method='first').astype(int).add_suffix('_ntile')], axis=1)
print(out)

# Output
      A  B  C   D  A_ntile  B_ntile  C_ntile  D_ntile
0  0.10  3  2   4        1        2        1        2
1  0.76  2  2   3        4        1        2        1
2  0.54  4  7   6        3        4        5        3
3  0.79  3  3  12        5        3        3        5
4  0.28  4  4   7        2        5        4        4