Pandas join results in mismatch shape-CodePudding

I have two dataframes:

Frame A:

OB_ID    CA_ID    col1    col2    col3
4         4        a       b       c
4         4        a       d       b
3         5        c       c       e

this dataframe is very large and I am not sure about all the IDs in it.

Frame B:

OB_ID    CA_ID    colZ
1         1        sky
4         4        fire
4         3        data

I want to add the colZ to frame A only if OB_ID and CA_ID matches, otherwise add Nan/NUll in that palce, resulting dataframe to look like this:

OB_ID    CA_ID    col1    col2    col3    colz
4         4        a       b       c       fire
4         4        a       d       b       fire
3         5        c       c       e       NA/unknown

The shape of Frame A is 8666515 rows × 3 columns, and for B is 367469 rows × 342 columns, but when I do

df3 = pd.merge(frameA, frameB, on=['OB_ID','CA_ID'], how='left')
df3.shape

it gives shape: 1490420 rows × 343 columns, but I dnt understand why the number of rows grow from 367469 to 1490420 .

CodePudding user response：

The resultant number of rows after left-merge is greater than that of left dataframe because of nonunique entries in the on part, i.e., the ["OB_ID", "CA_ID"] columns of the left frame, and how pandas tries to align things. For example, frameA has the pair [4, 4] two times. If the frameB, say, has it 3 times, then the alignment procedure dictates that the resultant frame after merge will have 2 x 3 = 6 [4, 4] rows, i.e., the cross product will be taken. (Note: i hope the example [4, 4] isn't confusing: it's not that 4 is repeated twice there; it's that [4, 4] as a pair/row is repeated twice.)

This is a theme that resonates over pandas:

>>> pd.Series([12, 3], index=[0, 0])   pd.Series([40, 50, 60], index=[0, 0, 0])

What would you say about the index & size of the result of this summation? Yes it is 2 x 3 = 6 zeros.

Back to the question... Instead of merging, we can map the frameA's related column values, right? And the mapper would be frameB modified slightly: a Series is kind of a dictionary (mapping) whose index are the "keys" and the "values", well, the corresponding values. So we will make ["OB_ID", "CA_ID"] the index, "colZ" the values; that will make mapper. Thing is, we cannot directly map df[["OB_ID", "CA_ID"]]... because a DataFrame isn't directly mappable. But a MultiIndex is! So here we go:

common_cols         = ["OB_ID", "CA_ID"]
target_col          = "colZ"
mapper              = frame_B.set_index(common_cols )[target_col]
frame_A[target_col] = pd.MultiIndex.from_frame(frame_A[common_cols]).map(mapper)

to get

>>> frame_A

   OB_ID  CA_ID col1 col2 col3  colZ
0      4      4    a    b    c  fire
1      4      4    a    d    b  fire
2      3      5    c    c    e   NaN

CodePudding user response：

Example dataframes:

frame_A = pd.DataFrame({
    'OB_ID': [4, 4, 3], 
    'CA_ID': [4, 4, 5], 
    'col1': ['a', 'a', 'c'], 
    'col2': ['b', 'd', 'c'], 
    'col3': ['c', 'b', 'e']
})
frame_B = pd.DataFrame({
    'OB_ID': [3, 3, 4], 
    'CA_ID': [5, 5, 6], 
    'colZ': ['sky', 'air', 'stone']
})
common_cols = ['OB_ID','CA_ID']

Current behaviour

( just visualising it for better understanding )

Note that (as @MustafaAydın explained), len(frame_A) == 3. But since the pair [3, 5] of ['OB_ID', 'CA_ID'] repeats twice in frame_B - it is used twice when merging, and the length of the result becomes 4:

pd.merge(frame_A, frame_B, on=common_cols, how='left')

	OB_ID	CA_ID	col1	col2	col3	colZ
0	4	4	a	b	c	nan
1	4	4	a	d	b	nan
2	3	5	c	c	e	sky
3	3	5	c	c	e	air

Possible Solution

Is this what you want?:

Using .drop_duplicates, you can choose only the first (of last) occurrences of rows in frame_B where common_cols are the same (between rows):

frame_B_dropped = frame_B.drop_duplicates(common_cols)

pd.merge(frame_A, frame_B_dropped, on=common_cols, how='left')

	OB_ID	CA_ID	col1	col2	col3	colZ
0	4	4	a	b	c	nan
1	4	4	a	d	b	nan
2	3	5	c	c	e	sky

Here the length of the result should always equal len(frame_A).