I have two dataframes:
Frame A:
OB_ID CA_ID col1 col2 col3
4 4 a b c
4 4 a d b
3 5 c c e
this dataframe is very large and I am not sure about all the IDs in it.
Frame B:
OB_ID CA_ID colZ
1 1 sky
4 4 fire
4 3 data
I want to add the colZ to frame A only if OB_ID and CA_ID matches, otherwise add Nan/NUll in that palce, resulting dataframe to look like this:
OB_ID CA_ID col1 col2 col3 colz
4 4 a b c fire
4 4 a d b fire
3 5 c c e NA/unknown
The shape of Frame A is 8666515 rows × 3 columns, and for B is 367469 rows × 342 columns, but when I do
df3 = pd.merge(frameA, frameB, on=['OB_ID','CA_ID'], how='left')
df3.shape
it gives shape: 1490420 rows × 343 columns
, but I dnt understand why the number of rows grow from 367469 to 1490420 .
CodePudding user response:
The resultant number of rows after left-merge is greater than that of left dataframe because of nonunique entries in the on
part, i.e., the ["OB_ID", "CA_ID"]
columns of the left frame, and how pandas tries to align things. For example, frameA has the pair [4, 4]
two times. If the frameB, say, has it 3 times, then the alignment procedure dictates that the resultant frame after merge will have 2 x 3 = 6 [4, 4]
rows, i.e., the cross product will be taken. (Note: i hope the example [4, 4]
isn't confusing: it's not that 4 is repeated twice there; it's that [4, 4]
as a pair/row is repeated twice.)
This is a theme that resonates over pandas:
>>> pd.Series([12, 3], index=[0, 0]) pd.Series([40, 50, 60], index=[0, 0, 0])
What would you say about the index & size of the result of this summation? Yes it is 2 x 3 = 6 zeros.
Back to the question... Instead of merging, we can map
the frameA's related column values, right? And the mapper
would be frameB modified slightly: a Series is kind of a dictionary (mapping) whose index are the "keys" and the "values", well, the corresponding values. So we will make ["OB_ID", "CA_ID"]
the index, "colZ"
the values; that will make mapper
. Thing is, we cannot directly map df[["OB_ID", "CA_ID"]]
... because a DataFrame isn't directly mappable. But a MultiIndex is! So here we go:
common_cols = ["OB_ID", "CA_ID"]
target_col = "colZ"
mapper = frame_B.set_index(common_cols )[target_col]
frame_A[target_col] = pd.MultiIndex.from_frame(frame_A[common_cols]).map(mapper)
to get
>>> frame_A
OB_ID CA_ID col1 col2 col3 colZ
0 4 4 a b c fire
1 4 4 a d b fire
2 3 5 c c e NaN
CodePudding user response:
Example dataframes:
frame_A = pd.DataFrame({
'OB_ID': [4, 4, 3],
'CA_ID': [4, 4, 5],
'col1': ['a', 'a', 'c'],
'col2': ['b', 'd', 'c'],
'col3': ['c', 'b', 'e']
})
frame_B = pd.DataFrame({
'OB_ID': [3, 3, 4],
'CA_ID': [5, 5, 6],
'colZ': ['sky', 'air', 'stone']
})
common_cols = ['OB_ID','CA_ID']
Current behaviour
( just visualising it for better understanding )
Note that (as @MustafaAydın explained), len(frame_A)
== 3. But since the pair [3, 5]
of ['OB_ID', 'CA_ID']
repeats twice in frame_B
- it is used twice when merging, and the length of the result becomes 4:
pd.merge(frame_A, frame_B, on=common_cols, how='left')
OB_ID | CA_ID | col1 | col2 | col3 | colZ | |
---|---|---|---|---|---|---|
0 | 4 | 4 | a | b | c | nan |
1 | 4 | 4 | a | d | b | nan |
2 | 3 | 5 | c | c | e | sky |
3 | 3 | 5 | c | c | e | air |
Possible Solution
Is this what you want?:
Using .drop_duplicates
, you can choose only the first (of last) occurrences of rows in frame_B
where common_cols
are the same (between rows):
frame_B_dropped = frame_B.drop_duplicates(common_cols)
pd.merge(frame_A, frame_B_dropped, on=common_cols, how='left')
OB_ID | CA_ID | col1 | col2 | col3 | colZ | |
---|---|---|---|---|---|---|
0 | 4 | 4 | a | b | c | nan |
1 | 4 | 4 | a | d | b | nan |
2 | 3 | 5 | c | c | e | sky |
Here the length of the result should always equal len(frame_A)
.