Home > Net >  How to merge two datasets on incomplete columns?
How to merge two datasets on incomplete columns?

Time:11-10

I want to merge two datasets on 'key1' and 'key2' columns so that in case of missing value, for example, in the 'key2' column, it would take all combinations of the second key that belong to the first key. Here is an example:

    def merge_nan_as_any(mask, data, on, how)
        ...


    mask = pd.DataFrame({'key1': [1,1,2,2],
                         'key2': [None,3,1,2],
                         'value2': [1,2,3,4]})
    data = pd.DataFrame({'key1': [1,1,1,2,2,2],
                         'key2': [1,2,3,1,2,3],
                         'value1': [1,2,3,4,5,6]})

    result = merge_nan_as_any(mask, data, on=['key1', 'key2'], how='left')

    result = pd.DataFrame({'key1': [1,1,1,1,2,2],
                           'key2': [1,2,3,3,1,2],
                           'value2': [1,1,1,2,3,4],
                           'value1': [1,2,3,3,4,5]})

There is a missed value of the second key, so it takes all rows from the second dataset that satisfy the condition: key1 must equal to 1, key2 is any the second key value from the second dataset. How to do that?

The first obvious solution that came to my mind is to iterate over the first dataset and filter out combinations that satisfy the condition and the second one is to split the first dataset into several ones so that they have NaNs in the same columns and merge each of them on columns that have values.

But I don't like these solutions and guess there is more elegant way to do what I want.

I will appreciate for any help!

CodePudding user response:

Simple approach, merge on key1/key2 for the non-NaN values, merge on key1 only for the NaN values and concat:

m = mask['key2'].notna()

result = pd.concat([data.merge(mask[~m].drop(columns='key2'), on='key1'),
                    data.merge(mask[m], on=['key1', 'key2']),
                   ], ignore_index=True)

Output:

   key1  key2  value1  value2
0     1     1       1       1
1     1     2       2       1
2     1     3       3       1
3     1     3       3       2
4     2     1       4       3
5     2     2       5       4

CodePudding user response:

I would begin by filling the null values with a list of all unique values from the other dataframe. Then, explode it to get all possible combinations and transform back to numeric. Finally, merge them both achieving the expected output:

mask['key2'] = mask['key2'].fillna(' '.join([str(x) for x in data['key2'].unique()])).astype(str).str.split(' ')
mask = mask.explode('key2')
mask['key2'] = pd.to_numeric(mask['key2'])
pd.merge(mask,data,on=['key1','key2'],how='left')

Outputting:

   key1  key2  value2  value1
0     1     1       1       1
1     1     2       1       2
2     1     3       1       3
3     1     3       2       3
4     2     1       3       4
5     2     2       4       5
  • Related