I have a big dataframe with rows containing an ID within a string column. I want to reduce the rows in the dataframe based on the 0/1 matrix containing IDs as column names. To illustrate, I created a dataframe df1
:
import pandas as pd
df1 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-03', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-05', '2021-01-05', '2021-01-05'],
'ID':['R: ID1', 'ID2_P', 'L_ID2', 'ID1.I', 'RZ:ID3', 'ID1', 'P ID2', 'ID3 4K', 'ID1', 'U_ID1', 'ID2:PD', 'ID3=2D'],
'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5, 4, 10, 3, 6]})
and a matrix m1
containing 0 and 1:
m1 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
'ID1':[0,0,1,1,1],
'ID2':[1,1,1,1,0],
'ID3':[0,0,0,0,1]})
Conditioned on m1
, the dataframe df1
needs to be reduced to df2
:
df2 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-01', '2021-01-03', '2021-01-03', '2021-01-05', '2021-01-05', '2021-01-05'],
'ID':['ID2_P', 'L_ID2', 'ID1.I', 'P ID2', 'ID1', 'U_ID1', 'ID3=2D'],
'Ratings':[8.0, 5.0, 3, 6, 4, 10, 6]})
So far, I tried to iterate through the rows of df1
with a for loop
but I wasn't able to get a result.
Thanks a lot for your suggestions!
CodePudding user response:
Use DataFrame.melt
with filter 1
rows first:
df2 = m1.melt('Date', var_name='ID1').query('value == 1')
print (df2)
Date ID1 value
2 2021-01-03 ID1 1
3 2021-01-04 ID1 1
4 2021-01-05 ID1 1
5 2021-01-01 ID2 1
6 2021-01-02 ID2 1
7 2021-01-03 ID2 1
8 2021-01-04 ID2 1
14 2021-01-05 ID3 1
Then get ID
by df2["ID1"].unique()
in Series.str.extract
:
df1['ID1'] = df1['ID'].str.extract(f'({"|".join(df2["ID1"].unique())})', expand=False)
And last filter by Date
and ID1
in inner default join by DataFrame.merge
, remove helper ID1
column:
df = df1.merge(df2[['Date','ID1']]).drop('ID1', axis=1)
print (df)
Date ID Ratings
0 2021-01-01 ID2_P 8.0
1 2021-01-01 L_ID2 5.0
2 2021-01-03 ID1 3.0
3 2021-01-03 P ID2 6.0
4 2021-01-05 ID1 4.0
5 2021-01-05 U_ID1 10.0
6 2021-01-05 ID3=2D 6.0