So,here I want to take first 3 character(only alphabets/no) from column 1 and first 4 character(only alphabets/no) from column 2 and first 5 digits from column3a,column3b,column3c and column3d(whichever is present) and make an array of them like in desired output column given below.Condition is that I need to remove any kind of special characters like .,-,' etc and spaces and take only alphabetical and numerical characters.Also output should be NaN if any one of column1,2 or 3 is not present.(If both 1 and 2 present and 1 from column 3 is present then output should come.
Column1 Column2 Colum3a Colum3b Colum3c Colum3d S.NO DESIRED OUTPUT
ABCDE QWERTY 12345678 1223456 234567 1234589 1
T.BCDF W ERTY 567890 NaN NaN NaN 2
ERTYUMH TY-IOPU 9845366 5672341 NaN NaN 3
NaN ERTYUI 1986788 NaN NaN NaN 4
SA--RTYUNK QWPOIJH NaN NaN 34564557 NaN 5
WQER QWER NaN NaN NaN NaN 6
S'E WERTER 12233412 NaN NaN 5678908 7
Desired output column should be like:
DESIRED OUTPUT S.NO
["ABC|QWER|12345","ABC|QWER|12234","ABC|QWER|23456","ABC|QWER|12345"] 1
[TBC|WER|56789] 2
["ERT|TYIO|98453","ERT|TYIO|56723"] 3
NaN 4
[SAR|QWPO|34564] 5
NaN 6
["SE|WERT|12233","SE|WERT|56789"] 7
Please help me with this.I m concatenating using below codes but don't know how to make array with 3rd columns.
df1['column4'] = (df1['Column1'].str[:3] '|'
df1['Column2'].str[:4] '|'
df1['Column3'].astype(str).replace({'^nan$':None, '\.0$':''}, regex=True))
CodePudding user response:
Here is a solution as a pipeline, that should work for an arbitrary number of columns. The only requirement is that one should be able to filter the column names (here using Column
and Colum3
as pattern), otherwise one need to build a list of those columns and use classical slicing:
(df.filter(like='Column').apply(lambda c: c.str.replace('\W', '', regex=True).str[:3])
.join(df['S.NO DESIRED OUTPUT'])
.assign(Column3=df.filter(like='Colum3').apply(list, axis=1))
.explode('Column3').dropna(subset=['Column3'])
.assign(Column3=lambda d: d['Column3'].astype(str).str[:5])
.set_index('S.NO DESIRED OUTPUT')
.astype(str)
.apply('|'.join, axis=1)
.groupby(level=0).apply(list)
.rename('DESIRED OUTPUT')
.mask(df.filter(like='Column').isna().any(1))
.reset_index()
)
output:
S.NO DESIRED OUTPUT DESIRED OUTPUT
0 1 [ABC|QWE|12345, ABC|QWE|12234, ABC|QWE|23456, ...
1 2 [TBC|WER|56789]
2 3 NaN
3 4 [nan|ERT|19867]
4 5 [SAR|QWP|34564]
5 7 NaN
CodePudding user response:
Use:
#join first 3 and 4 values in columns
s = (df1['Column1'].str.replace('\W','').str[:3] '|'
df1['Column2'].str.replace('\W','').str[:4] '|' )
#all another columns convert to strings, replace and add to s
f = lambda x: s x.astype(str).replace({'^nan$':None, '\.0$':''}, regex=True).str[:5]
#for column filtered by name ('Colum3') add values to list
df1 = (df1.filter(like='Colum3').apply(f)
.stack()
.groupby(level=0)
.agg(list)
.to_frame('new')
.join(df['S.NO'], how='right'))
print (df1)
new S.NO
0 [ABC|QWER|12345, ABC|QWER|12234, ABC|QWER|2345... 1
1 [TBC|WERT|56789] 2
2 [ERT|TYIO|98453, ERT|TYIO|56723] 3
3 NaN 4
4 [SAR|QWPO|34564] 5
5 NaN 6
6 [SE|WERT|12233, SE|WERT|56789] 7