My problem concerns lookup()
, which is to be depreciated. So I'm looking for an alternative. Documentation suggests using loc()
(which does not seem to work with a vectorized approach) or melt()
(which seems quite convoluted). Furthermore, the documentation suggests factorize()
which (I think) does not work for my setup.
Here is the problem: I have a 2-column DataFrame with x,y-values.
k = 20
y = random.choices(range(1,4),k=k)
x = random.choices(range(1,7),k=k)
tuples = list(zip(x,y))
df = pd.DataFrame(tuples, columns=["x", "y"])
df
And I have several DataFrames in crosstab-format of df
. For example one called Cij
:
Concordance table (Cij):
x 1 2 3 4 5 6 RTotal
y
1 16 15 13 NaN 5 NaN 108
2 NaN 12 NaN 15 NaN NaN 87
3 NaN NaN 6 NaN 13 14 121
I now want to perform a vectorized lookup in Cij
from xy-pairs in df
to generate a new column CrC
in df
. Which so far looked like this (plain and simple):
df["Crc"] = Cij.lookup(df["y"],df["x"])
How can I achieve the same thing without lookup()
? Or did I just not understand the suggested alternatives?
Thanks in advance!
Addendum: Working code example as requested.
data = [[1,1],[1,1],[1,2],[1,2],[1,2],[1,3],[1,3],[1,5],[2,2],[2,4],[2,4],[2,4],[2,4],[2,4],[3,3],[3,3],[3,5],[3,5],[3,5],[3,6],[3,6],[3,6],[3,6],[3,6]]
df = pd.DataFrame(data, columns=["y", "x"])
# crosstab of df
ct_a = pd.crosstab(df["y"], df["x"])
Cij = pd.DataFrame([], index=ct_a.index, columns=ct_a.columns) #one of several dfs in ct_a layout
#row-wise, than column-wise filling of Cij
for i in range(ct_a.shape[0]):
for j in range(ct_a.shape[1]):
if ct_a.iloc[i,j] != 0:
Cij.iloc[i,j]= ct_a.iloc[i 1:,j 1:].sum().sum() ct_a.iloc[:i,:j].sum().sum()
#vectorized lookup, to be substituted with future-proof method
df["Crc"] = Cij.lookup(df["y"],df["x"])
Note: In this case loop-based "filling" of Cij
is fine, since crosstabs of df
are always small. However, df
itself can be very large so vectorized lookup is a necessity.
CodePudding user response:
If you have checked df["Crc"] = Cij.loc[df["y"], df["x"]]
, you will notice that it returns an array. By comparing this with df["Crc"] = Cij.lookup(df["y"],df["x"])
, you will also notice that the leading diagonal is the same (which makes sense). Therefore, you can add np.diagonal
to return what you need:
df["Crc"] = np.diagonal(Cij.loc[df["y"], df["x"]])
CodePudding user response:
IIUC, you can stack
Cij and then reindex
based on a list of tuples created by using zip
:
df['Crc'] = Cij.stack().reindex(zip(df['y'], df['x'])).to_numpy()
print(df)
Output:
y x Crc
0 1 1 16.0
1 1 1 16.0
2 1 2 15.0
3 1 2 15.0
4 1 2 15.0
5 1 3 13.0
6 1 3 13.0
7 1 5 5.0
8 2 2 12
9 2 4 15
10 2 4 15
11 2 4 15
12 2 4 15
13 2 4 15
14 3 3 6.0
15 3 3 6.0
16 3 5 13.0
17 3 5 13.0
18 3 5 13.0
19 3 6 14.0
20 3 6 14.0
21 3 6 14.0
22 3 6 14.0
23 3 6 14.0