Home > Blockchain >  pandas vectorized lookup without depreciated lookup()
pandas vectorized lookup without depreciated lookup()

Time:06-27

My problem concerns lookup(), which is to be depreciated. So I'm looking for an alternative. Documentation suggests using loc() (which does not seem to work with a vectorized approach) or melt() (which seems quite convoluted). Furthermore, the documentation suggests factorize() which (I think) does not work for my setup.

Here is the problem: I have a 2-column DataFrame with x,y-values.

k = 20
y = random.choices(range(1,4),k=k)
x = random.choices(range(1,7),k=k)
tuples = list(zip(x,y))
df = pd.DataFrame(tuples, columns=["x", "y"])
df

And I have several DataFrames in crosstab-format of df. For example one called Cij:

Concordance table (Cij):
x     1     2     3    4     5     6  RTotal
y                                           
1   16     15    13  NaN     5   NaN     108
2   NaN    12   NaN   15   NaN   NaN      87
3   NaN   NaN     6  NaN    13    14     121

I now want to perform a vectorized lookup in Cij from xy-pairs in df to generate a new column CrC in df. Which so far looked like this (plain and simple):

df["Crc"] = Cij.lookup(df["y"],df["x"])

How can I achieve the same thing without lookup()? Or did I just not understand the suggested alternatives?

Thanks in advance!

Addendum: Working code example as requested.

data = [[1,1],[1,1],[1,2],[1,2],[1,2],[1,3],[1,3],[1,5],[2,2],[2,4],[2,4],[2,4],[2,4],[2,4],[3,3],[3,3],[3,5],[3,5],[3,5],[3,6],[3,6],[3,6],[3,6],[3,6]]
df = pd.DataFrame(data, columns=["y", "x"])

# crosstab of df
ct_a = pd.crosstab(df["y"], df["x"])
Cij = pd.DataFrame([], index=ct_a.index, columns=ct_a.columns) #one of several dfs in ct_a layout

#row-wise, than column-wise filling of Cij
for i in range(ct_a.shape[0]):           
  for j in range(ct_a.shape[1]):
    if ct_a.iloc[i,j] != 0:
      Cij.iloc[i,j]= ct_a.iloc[i 1:,j 1:].sum().sum() ct_a.iloc[:i,:j].sum().sum()

#vectorized lookup, to be substituted with future-proof method
df["Crc"] = Cij.lookup(df["y"],df["x"])

Note: In this case loop-based "filling" of Cij is fine, since crosstabs of df are always small. However, df itself can be very large so vectorized lookup is a necessity.

CodePudding user response:

If you have checked df["Crc"] = Cij.loc[df["y"], df["x"]], you will notice that it returns an array. By comparing this with df["Crc"] = Cij.lookup(df["y"],df["x"]), you will also notice that the leading diagonal is the same (which makes sense). Therefore, you can add np.diagonal to return what you need:

df["Crc"] = np.diagonal(Cij.loc[df["y"], df["x"]])

CodePudding user response:

IIUC, you can stack Cij and then reindex based on a list of tuples created by using zip:

df['Crc'] = Cij.stack().reindex(zip(df['y'], df['x'])).to_numpy()
print(df)

Output:

    y  x   Crc
0   1  1  16.0
1   1  1  16.0
2   1  2  15.0
3   1  2  15.0
4   1  2  15.0
5   1  3  13.0
6   1  3  13.0
7   1  5   5.0
8   2  2    12
9   2  4    15
10  2  4    15
11  2  4    15
12  2  4    15
13  2  4    15
14  3  3   6.0
15  3  3   6.0
16  3  5  13.0
17  3  5  13.0
18  3  5  13.0
19  3  6  14.0
20  3  6  14.0
21  3  6  14.0
22  3  6  14.0
23  3  6  14.0
  • Related