How to join multiple column with multiple other column(IN CROSS) in pandas and giving the array in r-CodePudding

So,here I want to take first 3 character(only alphabets/no) from column 1 and first 4 character(only alphabets/no) from column 2 and first 5 digits from column3a,column3b,column3c and column3d(whichever is present) and make an array of them like in desired output column given below.Condition is that I need to remove any kind of special characters like .,-,' etc and spaces and take only alphabetical and numerical characters.Also output should be NaN if any one of column1,2 or 3 is not present.(If both 1 and 2 present and 1 from column 3 is present then output should come.

Column1     Column2     Colum3a     Colum3b     Colum3c     Colum3d    S.NO DESIRED OUTPUT 
ABCDE       QWERTY      12345678    1223456     234567      1234589     1      
T.BCDF      W ERTY      567890      NaN         NaN         NaN         2
ERTYUMH     TY-IOPU     9845366     5672341     NaN         NaN         3
NaN         ERTYUI      1986788     NaN         NaN         NaN         4
SA--RTYUNK  QWPOIJH     NaN         NaN         34564557    NaN         5
WQER        QWER        NaN         NaN         NaN         NaN         6
S'E         WERTER      12233412    NaN         NaN         5678908     7

Desired output column should be like:

DESIRED OUTPUT                                                        S.NO
["ABC|QWER|12345","ABC|QWER|12234","ABC|QWER|23456","ABC|QWER|12345"]   1
[TBC|WER|56789]                                                         2
["ERT|TYIO|98453","ERT|TYIO|56723"]                                     3
NaN                                                                     4
[SAR|QWPO|34564]                                                        5
NaN                                                                     6
["SE|WERT|12233","SE|WERT|56789"]                                       7

Please help me with this.I m concatenating using below codes but don't know how to make array with 3rd columns.

df1['column4'] = (df1['Column1'].str[:3]   '|'  
              df1['Column2'].str[:4]   '|'  
              df1['Column3'].astype(str).replace({'^nan$':None, '\.0$':''}, regex=True))

CodePudding user response：

Here is a solution as a pipeline, that should work for an arbitrary number of columns. The only requirement is that one should be able to filter the column names (here using Column and Colum3 as pattern), otherwise one need to build a list of those columns and use classical slicing:

(df.filter(like='Column').apply(lambda c: c.str.replace('\W', '', regex=True).str[:3])
   .join(df['S.NO DESIRED OUTPUT'])
   .assign(Column3=df.filter(like='Colum3').apply(list, axis=1))
   .explode('Column3').dropna(subset=['Column3'])
   .assign(Column3=lambda d: d['Column3'].astype(str).str[:5])
   .set_index('S.NO DESIRED OUTPUT')
   .astype(str)  
   .apply('|'.join, axis=1)
   .groupby(level=0).apply(list)
   .rename('DESIRED OUTPUT')
   .mask(df.filter(like='Column').isna().any(1))
   .reset_index()
)

output:

   S.NO DESIRED OUTPUT                                     DESIRED OUTPUT
0                    1  [ABC|QWE|12345, ABC|QWE|12234, ABC|QWE|23456, ...
1                    2                                    [TBC|WER|56789]
2                    3                                                NaN
3                    4                                    [nan|ERT|19867]
4                    5                                    [SAR|QWP|34564]
5                    7                                                NaN

CodePudding user response：

Use:

#join first 3 and 4 values in columns
s = (df1['Column1'].str.replace('\W','').str[:3]   '|'   
     df1['Column2'].str.replace('\W','').str[:4]   '|' )

#all another columns convert to strings, replace and add to s
f = lambda x: s   x.astype(str).replace({'^nan$':None, '\.0$':''}, regex=True).str[:5]
              
#for column filtered by name ('Colum3') add values to list
df1 = (df1.filter(like='Colum3').apply(f)
          .stack()
          .groupby(level=0)
          .agg(list)
          .to_frame('new')
          .join(df['S.NO'], how='right'))

print (df1)
                                                 new  S.NO
0  [ABC|QWER|12345, ABC|QWER|12234, ABC|QWER|2345...     1
1                                   [TBC|WERT|56789]     2
2                   [ERT|TYIO|98453, ERT|TYIO|56723]     3
3                                                NaN     4
4                                   [SAR|QWPO|34564]     5
5                                                NaN     6
6                     [SE|WERT|12233, SE|WERT|56789]     7