Compare a Column value to different columns and return a value from same row but different column us-CodePudding

I got data as in image 1.

By using for loop in Python Pandas, I have splitted FG and PID columns as in image 2.

Based on the condition, I want the output as picture 3:

if Main_FG == FG_1, value will be from PID_1
if Main_FG == FG_2, value will be from PID_2
if Main_FG == FG_3, value will be from PID_3

I have tried using for loop but does not help.

What would be the effective way to do that in Python? Please note that, the number of FG_1, FG_xxxxx, varies therefore, I need code to be dynamic and I am unable to do by using np.select.

CodePudding user response：

My approach is to define an external function to select the relevant PID

import pandas as pd

df = pd.DataFrame({'Main_FG': ['AB', 'AC', 'AM', 'AP', 'AE', 'RE', 'rw'],
                   'FG': ['CD, AB,QF', 'AC', 'AM', 'AD, GH,AP', 'GH, ae, ME', 'RE', 'RW'],
                   'PID': ['A.900,B1.56,C1.650', 'R1', 'Q1.45', 'S1.76,S2,56,S3', 'J1.33,J2.39,S3', 'B1.09, B2.56, E2', 'S3.07']  })
print(df)

  Main_FG          FG                 PID
0      AB   CD, AB,QF  A.900,B1.56,C1.650
1      AC          AC                  R1
2      AM          AM               Q1.45
3      AP   AD, GH,AP      S1.76,S2,56,S3    #note the whitespaces between commas
4      AE  GH, ae, ME      J1.33,J2.39,S3    #note the lowercase
5      RE          RE    B1.09, B2.56, E2    #note the whitespaces between commas
6      rw          RW               S3.07    #note the lowercase

The function also takes care of 2 data cleaning steps:
(1) to compare based on lowercase,
(2) to split on comma but also to remove whitespaces

def select_pid(row):
    idx = [e.strip().lower() for e in row['FG'].split(',')].index(row['Main_FG'].lower())
    return [e.strip() for e in row['PID'].split(',')][idx]
    
df['SELECTED_PID'] = df.apply(lambda row: select_pid(row), axis=1)
print(df)

  Main_FG          FG                 PID SELECTED_PID
0      AB   CD, AB,QF  A.900,B1.56,C1.650        B1.56
1      AC          AC                  R1           R1
2      AM          AM               Q1.45        Q1.45
3      AP   AD, GH,AP      S1.76,S2,56,S3           56
4      AE  GH, ae, ME      J1.33,J2.39,S3        J2.39
5      RE          RE    B1.09, B2.56, E2        B1.09
6      rw          RW               S3.07        S3.07

CodePudding user response：

Using a dummy example as yours is not reproducible.

You can use a list comprehension to loop over the key/values as a dictionary:

df['selected_PID'] = [dict(zip(k,v)).get(ref)
                      for ref,k,v in zip(df['main_FG'].str.upper(),
                                         df['FG'].str.split(','),
                                         df['PID'].str.split(','))]

NB. If you want to drop the columns, use df.pop('FG').str.split(',') in place of df['FG'].str.split(',') (same for PID).

Alternative with explode:

(df.assign(FG=df['FG'].str.split(','),
          PID=df['PID'].str.split(',')
         )
   .explode(['FG', 'PID'])
   .loc[lambda x: x['main_FG'].eq(x['FG'])]
   .rename(columns={'PID': 'selected_PID'})
   .drop(columns='FG')
)

output:

  main_FG     FG    PID selected_PID
0       B  A,B,C  U,V,W            V
1       E    D,E    X,Y            Y
2       F      F      Z            Z

used input:

  main_FG     FG    PID
0       B  A,B,C  U,V,W
1       E    D,E    X,Y
2       F      F      Z

CodePudding user response：

You can use something like that:

from pandas import DataFrame, concat

df=DataFrame(data={'Main_FG': ['AB', 'AC', 'AM', 'AP', 'AE', 'RE', 'Rw'], 'FG_1': ['CD', 'AC', 'AM', 'AD', 'GH', 'RE', 'RW'], 'FG_2': ['AB', None, None, 'AP', 'AE', None, None], 'FG_3': ['QF', None, None, None, 'ME', None, None], 'PID_1': ['A.900', 'R1', 'Q1.45', 'S1.76', 'J1.33', 'B1.09', 'S3.07'], 'PID_2': ['B1.56', None, None, 'S2.56', 'J2.39', 'B2.56', None], 'PID_3': ['C1.650', None, None, 'S3', 'S3', 'E2', None]})
fg_cols = [col for col in df.columns if col.startswith('FG_')]
pid_cols = [col for col in df.columns if col.startswith('PID_')]
assert len(fg_cols) == len(pid_cols)
df_list = []
for i, fg_col in enumerate(fg_cols):
    df_list.append(df[df['Main_FG'] == df[fg_col]][['Main_FG', pid_cols[i]]].rename(columns={pid_cols[i]: 'SELECTED_PID'}))
result_df = concat(df_list)

  Main_FG SELECTED_PID
1      AC           R1
2      AM        Q1.45
5      RE        B1.09
0      AB        B1.56
3      AP        S2.56
4      AE        J2.39

You could sort the Main_FG column if you like.

CodePudding user response：

I am going from situation 2 to situation 3:

You can split your dataframe into two subdataframes. df1 containing FG_... and df2 containing PID_... columns. Create a mask using Main_FG and df1 and apply this on df2.

df = pd.DataFrame([
    ['AB', 'CD', 'AB', 'QF', 'A.900', 'B1.56', 'C1.650'],
    ['AC', 'AC', '', '', 'R1', '', ''],
    ['AM', 'AM', '', '', 'Q1.45', '', ''],
    ['AP', 'AD', 'GH', 'AP' , 'S1.76', 'S2.56', 'S3']],
    columns=['Main_FG', 'FG_1', 'FG_2', 'FG_3', 'PID_1', 'PID_2', 'PID_3'])

n = len(df) // 2   1
df1 = df.iloc[:, 1:n]
df2 = df.iloc[:, n:]

mask = df1.eq(df.Main_FG, axis='rows')
mask[mask == 0] = np.nan

df['SELECTED_PID'] = (df2.values * mask).bfill(axis=1).iloc[:, 0]
df3 = df[['Main_FG', 'SELECTED_PID']]

Output df3:

  Main_FG SELECTED_PID
0      AB        B1.56
1      AC           R1
2      AM        Q1.45
3      AP           S3

CodePudding user response：

It looks like problem described may be solved with a sequence of str.split

df['FG'].str.split(',')
df['PID'].str.split(',')

and explode method

df.explode(['FG', 'PID'])

Thus, you may arrive at the flattened dataframe