I have a problem with my code and try to solve it by myself but it has taken a lot of time and still failed.
I have two data frames like this.
df_doc
doc_id subgroup_doc
12345 A01B13/42
67891 B01C10/257
23456 C02D7/18
and df_check
subgroup_doc main sg_1 sg_2 sg_3 sg_4
A01B13/42 13/00 13/04 13/42 NaN NaN
B01C10/257 10/00 10/02 10/25 10/257 NaN
C02D7/18 7/00 7/01 7/18 NaN NaN
I intend to get a result like this
doc_id subgroup_doc doc_group
12345 A01B13/42 sg_2
67891 B01C10/257 sg_3
23456 C02D7/18 sg_2
then I created a function
def checkdot_ipc_sg2_v2(raw_data,check_group):
new_raw_data = raw_data[['doc_id','subgroup_doc']]
merge_find_group = pd.merge(new_raw_data, check_group, how='left', on=['subgroup_doc'])
t_doc = merge_find_group.subgroup_doc.str.split('/')
s_doc_sg2 = merge_find_group.sg_2.str.split('/')
s_doc_sg1 = merge_find_group.sg_1.str.split('/')
s_doc_sg0 = merge_find_group.maingroup.str.split('/')
raw_data['doc_checkgp'] = t_doc.str[0] "/" s_doc_sg2.str[1].str[:]
raw_data['doc_group'] = 'sg_2'
if(raw_data['doc_checkgp'].isna().any()):
raw_data['doc_checkgp'].fillna(t_doc.str[0] "/" s_doc_sg1.str[1].str[:], inplace=True)
raw_data['doc_group'] = 'sg_1'
if(raw_data['doc_checkgp'].isna().any()):
raw_data['doc_checkgp'].fillna(t_doc.str[0] "/" s_doc_sg0.str[1].str[:], inplace=True)
raw_data['doc_group'] = 'sg_0'
but results are returned as
doc_id subgroup_doc doc_checkgp doc_group
12345 A01B13/42 A01B13/42 sg_0
67891 B01C10/257 B01C10/257 sg_0
23456 C02D7/18 C02D7/18 sg_0
did I write something wrong? I think I did not make any mistake of if-condition.
CodePudding user response:
It is not very clear how combination of df_doc and df_check results in the outcome that you want.
But maybe you want something like this?
In [65]: df_doc.merge(df_check, on="subgroup_doc")
Out[65]:
doc_id subgroup_doc main sg_1 sg_2 sg_3 sg_4
0 12345 A01B13/42 13/00 13/04 13/42 NaN NaN
1 67891 B01C10/257 10/00 10/02 10/25 10/257 NaN
2 23456 C02D7/18 7/00 7/01 7/18 NaN NaN
CodePudding user response:
Do you want to get column names of df_check that have an entry? Still not sure what is your goal. But here is something I guess you wanted.
In [77]: df_doc.merge(df_check, on="subgroup_doc").set_index(["doc_id", "subgro
...: up_doc"]).stack().reset_index().drop(0,axis=1)
Out[77]:
doc_id subgroup_doc level_2
0 12345 A01B13/42 main
1 12345 A01B13/42 sg_1
2 12345 A01B13/42 sg_2
3 67891 B01C10/257 main
4 67891 B01C10/257 sg_1
5 67891 B01C10/257 sg_2
6 67891 B01C10/257 sg_3
7 23456 C02D7/18 main
8 23456 C02D7/18 sg_1
9 23456 C02D7/18 sg_2
CodePudding user response:
You can use:
dfx = df_doc.merge(df_check,on='subgroup_doc')
dfx['key_1'] = dfx['subgroup_doc'].str.split('/').str[1]
'''
key_1
0 42
1 257
2 18
'''
For each row, let's split the values of the columns in the list named cols using split and return the column name if the value after / matches the value of key_1.
cols = ['main','sg_1','sg_2','sg_3','sg_4']
dfx['check'] = dfx.apply(lambda x: [i if pd.notna(x[i]) and x['key_1'] == x[i].split('/')[1] else False for i in cols],axis=1)
'''
| | doc_id | subgroup_doc | main | sg_1 | sg_2 | sg_3 | sg_4 | key_1 | check |
|---:|---------:|:---------------|:-------|:-------|:-------|:-------|-------:|--------:|:-------------------------------------|
| 0 | 12345 | A01B13/42 | 13/00 | 13/04 | 13/42 | nan | nan | 42 | [False, False, 'sg_2', False, False] |
| 1 | 67891 | B01C10/257 | 10/00 | 10/02 | 10/25 | 10/257 | nan | 257 | [False, False, False, 'sg_3', False] |
| 2 | 23456 | C02D7/18 | 7/00 | 7/01 | 7/18 | nan | nan | 18 | [False, False, 'sg_2', False, False] |
'''
let's drop the false elements:
dfx['check'] = dfx['check'].apply(lambda x: [i for i in x if i]).explode()
Output::
dfx=dfx[['doc_id','subgroup_doc','check']]
| | doc_id | subgroup_doc | check |
|---:|---------:|:---------------|:--------|
| 0 | 12345 | A01B13/42 | sg_2 |
| 1 | 67891 | B01C10/257 | sg_3 |
| 2 | 23456 | C02D7/18 | sg_2 |