Home > database >  Check if columns representing a range contain values of a second dataframe
Check if columns representing a range contain values of a second dataframe

Time:03-30

I have two dataframes that look like this:

data = [['id', 'r0', 'r1'],
        ['123',   1, 6],
        ['436',   9, 26],
        ['791',  45, 200]]

df = pd.DataFrame(data)


data_2 = [['id', 'r3',   'name'],
        ['123',     1, 'name_a'],
        ['436',    12, 'name_f'],
        ['436',    78, 'name_q'],
        ['234',   300, 'name_d']]

df_2 = pd.DataFrame(data_2)

I want to include two extra columns on df that contain the following information:

  • is_contained: True/False. It tells if there is any value if df_2[r3] contained in the range df[r0-r1] where df[id] == df_2[id]. For instance, in the first two registers of df, it will be True and in the third one, it will be False.
  • name: if is_contained is True, it copies the value of df_2["name"]. If not, this string will be empty.

Therefore, the result for this simple example would be:

df:

     0   1    2             3        4
0   id  r0   r1  is_contained     name
1  123   1    6          True   name_a
2  456   9   26          True   name_f
3  791  45  200         False       

Taking into account than in my real example this computation will include a quite large df, which would be an efficient way to solve this problem?

CodePudding user response:

If I understood correctly (and assuming the first row as column names!), you can use pandas.merge_asof, then compute the "is_contained" column and update the "name" if it doesn't match the upper bound requirement.

Pre-requisite, ensure that the dataframes are sorted by 'r0' and 'r3'.

(pd
 .merge_asof(df, df_2, by='id', left_on='r0', right_on='r3', direction='forward')
 .assign(is_contained=lambda d: d['r3'].le(d['r1']),
         name=lambda d: d['name'].where(d['is_contained'])
        )
 #.drop(columns='r3') # optional, to remove merge column
)

Output:

    id  r0   r1    r3    name  is_contained
0  123   1    6   1.0  name_a          True
1  436   9   26  12.0  name_f          True
2  791  45  200   NaN     NaN         False

Used input:

data = [['123',   1, 6],
        ['436',   9, 26],
        ['791',  45, 200]]

df = pd.DataFrame(data, columns=['id', 'r0', 'r1'])


data_2 = [['123',     1, 'name_a'],
        ['436',    12, 'name_f'],
        ['436',    78, 'name_q'],
        ['234',   300, 'name_d']]

df_2 = pd.DataFrame(data_2, columns=['id', 'r3',   'name'])

CodePudding user response:

This did the trick

df.to_sql("df", conn, index=False)
df_2.to_sql("df_2", conn, index=False)
query = "SELECT *"
    "FROM df "\
    "LEFT JOIN df_2 "\
    "ON (df.id = df_2.id AND "\
    "df_2.r3 >= df.r0 AND "\
    "df_2.r3 <= df.r1);"

df = pd.read_sql_query(query,conn)
  • Related