I have two dataframes that look like this:
data = [['id', 'r0', 'r1'],
['123', 1, 6],
['436', 9, 26],
['791', 45, 200]]
df = pd.DataFrame(data)
data_2 = [['id', 'r3', 'name'],
['123', 1, 'name_a'],
['436', 12, 'name_f'],
['436', 78, 'name_q'],
['234', 300, 'name_d']]
df_2 = pd.DataFrame(data_2)
I want to include two extra columns on df that contain the following information:
is_contained
: True/False. It tells if there is any value ifdf_2[r3
] contained in the rangedf[r0-r1]
wheredf[id] == df_2[id]
. For instance, in the first two registers ofdf
, it will be True and in the third one, it will be False.name
: if is_contained is True, it copies the value ofdf_2["name"]
. If not, this string will be empty.
Therefore, the result for this simple example would be:
df
:
0 1 2 3 4
0 id r0 r1 is_contained name
1 123 1 6 True name_a
2 456 9 26 True name_f
3 791 45 200 False
Taking into account than in my real example this computation will include a quite large df
, which would be an efficient way to solve this problem?
CodePudding user response:
If I understood correctly (and assuming the first row as column names!), you can use pandas.merge_asof
, then compute the "is_contained" column and update the "name" if it doesn't match the upper bound requirement.
Pre-requisite, ensure that the dataframes are sorted by 'r0' and 'r3'.
(pd
.merge_asof(df, df_2, by='id', left_on='r0', right_on='r3', direction='forward')
.assign(is_contained=lambda d: d['r3'].le(d['r1']),
name=lambda d: d['name'].where(d['is_contained'])
)
#.drop(columns='r3') # optional, to remove merge column
)
Output:
id r0 r1 r3 name is_contained
0 123 1 6 1.0 name_a True
1 436 9 26 12.0 name_f True
2 791 45 200 NaN NaN False
Used input:
data = [['123', 1, 6],
['436', 9, 26],
['791', 45, 200]]
df = pd.DataFrame(data, columns=['id', 'r0', 'r1'])
data_2 = [['123', 1, 'name_a'],
['436', 12, 'name_f'],
['436', 78, 'name_q'],
['234', 300, 'name_d']]
df_2 = pd.DataFrame(data_2, columns=['id', 'r3', 'name'])
CodePudding user response:
This did the trick
df.to_sql("df", conn, index=False)
df_2.to_sql("df_2", conn, index=False)
query = "SELECT *"
"FROM df "\
"LEFT JOIN df_2 "\
"ON (df.id = df_2.id AND "\
"df_2.r3 >= df.r0 AND "\
"df_2.r3 <= df.r1);"
df = pd.read_sql_query(query,conn)