pandas-if condition in data frame to get column name-CodePudding

I have a problem with my code and try to solve it by myself but it has taken a lot of time and still failed.

I have two data frames like this.

df_doc

doc_id subgroup_doc
12345    A01B13/42
67891    B01C10/257
23456    C02D7/18

and df_check

  subgroup_doc      main   sg_1   sg_2   sg_3    sg_4
   A01B13/42       13/00   13/04  13/42   NaN     NaN
   B01C10/257      10/00   10/02  10/25  10/257   NaN
   C02D7/18         7/00   7/01   7/18    NaN     NaN

I intend to get a result like this

    doc_id subgroup_doc    doc_group
    12345    A01B13/42       sg_2 
    67891    B01C10/257      sg_3
    23456    C02D7/18        sg_2

then I created a function

def checkdot_ipc_sg2_v2(raw_data,check_group):
    
    new_raw_data = raw_data[['doc_id','subgroup_doc']]
    merge_find_group = pd.merge(new_raw_data, check_group, how='left', on=['subgroup_doc'])

    t_doc = merge_find_group.subgroup_doc.str.split('/')
    s_doc_sg2 = merge_find_group.sg_2.str.split('/')
    s_doc_sg1 = merge_find_group.sg_1.str.split('/')
    s_doc_sg0 = merge_find_group.maingroup.str.split('/')
  
    raw_data['doc_checkgp'] = t_doc.str[0]  "/"   s_doc_sg2.str[1].str[:]
    raw_data['doc_group'] = 'sg_2'

   
    if(raw_data['doc_checkgp'].isna().any()):
        raw_data['doc_checkgp'].fillna(t_doc.str[0]  "/"   s_doc_sg1.str[1].str[:], inplace=True)
        raw_data['doc_group'] = 'sg_1'
        
        if(raw_data['doc_checkgp'].isna().any()):
            raw_data['doc_checkgp'].fillna(t_doc.str[0]  "/"   s_doc_sg0.str[1].str[:], inplace=True)
            raw_data['doc_group'] = 'sg_0'

but results are returned as

  doc_id subgroup_doc   doc_checkgp   doc_group
    12345    A01B13/42     A01B13/42      sg_0 
    67891    B01C10/257    B01C10/257     sg_0
    23456    C02D7/18      C02D7/18       sg_0

did I write something wrong? I think I did not make any mistake of if-condition.

CodePudding user response：

It is not very clear how combination of df_doc and df_check results in the outcome that you want.

But maybe you want something like this?

In [65]: df_doc.merge(df_check, on="subgroup_doc")
Out[65]:
   doc_id subgroup_doc   main   sg_1   sg_2    sg_3  sg_4
0   12345    A01B13/42  13/00  13/04  13/42     NaN   NaN
1   67891   B01C10/257  10/00  10/02  10/25  10/257   NaN
2   23456     C02D7/18   7/00   7/01   7/18     NaN   NaN

CodePudding user response：

Do you want to get column names of df_check that have an entry? Still not sure what is your goal. But here is something I guess you wanted.

In [77]: df_doc.merge(df_check, on="subgroup_doc").set_index(["doc_id", "subgro
    ...: up_doc"]).stack().reset_index().drop(0,axis=1)
Out[77]:
   doc_id subgroup_doc level_2
0   12345    A01B13/42    main
1   12345    A01B13/42    sg_1
2   12345    A01B13/42    sg_2
3   67891   B01C10/257    main
4   67891   B01C10/257    sg_1
5   67891   B01C10/257    sg_2
6   67891   B01C10/257    sg_3
7   23456     C02D7/18    main
8   23456     C02D7/18    sg_1
9   23456     C02D7/18    sg_2

CodePudding user response：

You can use:

dfx = df_doc.merge(df_check,on='subgroup_doc')
dfx['key_1'] = dfx['subgroup_doc'].str.split('/').str[1]
'''
    key_1
0   42
1   257
2   18

'''

For each row, let's split the values of the columns in the list named cols using split and return the column name if the value after / matches the value of key_1.

cols = ['main','sg_1','sg_2','sg_3','sg_4']
dfx['check'] = dfx.apply(lambda x: [i if pd.notna(x[i]) and x['key_1'] == x[i].split('/')[1] else False for i in cols],axis=1)
'''
|    |   doc_id | subgroup_doc   | main   | sg_1   | sg_2   | sg_3   |   sg_4 |   key_1 | check                                |
|---:|---------:|:---------------|:-------|:-------|:-------|:-------|-------:|--------:|:-------------------------------------|
|  0 |    12345 | A01B13/42      | 13/00  | 13/04  | 13/42  | nan    |    nan |      42 | [False, False, 'sg_2', False, False] |
|  1 |    67891 | B01C10/257     | 10/00  | 10/02  | 10/25  | 10/257 |    nan |     257 | [False, False, False, 'sg_3', False] |
|  2 |    23456 | C02D7/18       | 7/00   | 7/01   | 7/18   | nan    |    nan |      18 | [False, False, 'sg_2', False, False] |
'''

let's drop the false elements:

dfx['check'] = dfx['check'].apply(lambda x: [i for i in x if i]).explode()

Output::

dfx=dfx[['doc_id','subgroup_doc','check']]

|    |   doc_id | subgroup_doc   | check   |
|---:|---------:|:---------------|:--------|
|  0 |    12345 | A01B13/42      | sg_2    |
|  1 |    67891 | B01C10/257     | sg_3    |
|  2 |    23456 | C02D7/18       | sg_2    |