Home > Software engineering >  How to efficiently split text in a list using python using the below example?
How to efficiently split text in a list using python using the below example?

Time:12-07

I have a list something like this:

['Security Name % to Net Assets* Assets* DEBENtURES 0.04 Britannia Industries Ltd. 0.04 Britannia Industries Ltd. 0.04 EQUity & EQUity RELAtED 96.83 EQUity & EQUity RELAtED 96.83 HDFC Ban
k Ltd. 6.98 HDFC Bank Ltd. 6.98 ICICI Bank Ltd. 4.82 ICICI Bank Ltd. 4.82 Infosys Ltd. 4.37 Infosys Ltd. 4.37 Reliance Industries Ltd. 4.05 Reliance Industries Ltd. 4.05 Bajaj\tFinance\tL
td. 3.82 Bajaj\tFinance\tLtd. 3.82 Housing Development Finance Corpn. Ltd. 3.23 Housing Development Finance Corpn. Ltd. 3.23 Grindwell Norton Ltd. 3.22 SRF Ltd. 3.22 SRF Ltd. 3.22 Sun Pha
rmaceutical Industries Ltd. 2.85 Sun Pharmaceutical Industries Ltd. 2.85 Bharti Airtel Ltd. 2.82 DLF Ltd. 2.64 DLF Ltd. 2.64 Ultratech Cement Ltd. 2.62 Ultratech Cement Ltd. 2.62 SKF Indi
a Ltd. 2.45 Crompton Greaves Consumer Electricals Ltd. 2.42 Crompton Greaves Consumer Electricals Ltd. 2.42 Avenue Supermarts Ltd. 2.41 Avenue Supermarts Ltd. 2.41 Axis Bank Ltd. 2.41 ABB
 India Ltd. 2.35 ABB India Ltd. 2.35 Titan Co. Ltd. 2.29 Titan Co. Ltd. 2.29 Kotak Mahindra Bank Ltd. 2.09 Cipla Ltd. 2.05 Cipla Ltd. 2.05 Laurus Labs Ltd. 2.04 Laurus Labs Ltd. 2.04 Wipr
o Ltd. 1.77 Happiest Minds Technologies Ltd. 1.68 Happiest Minds Technologies Ltd. 1.68 Canara Bank 1.67 Canara Bank 1.67 Shree Cement Ltd. 1.63 Security Name % to Net Assets* Assets* Mah
indra & Mahindra Ltd. 1.59 Pidilite Industries Ltd. 1.50 Pidilite Industries Ltd. 1.50 ICICI Lombard General Insurance Co. Ltd. 1.48 ICICI Lombard General Insurance Co. Ltd. 1.48 Cholaman
dalam Investment & Finance Co. Ltd. 1.45 Cholamandalam Investment & Finance Co. Ltd. 1.45 Tech Mahindra Ltd. 1.35 Tech Mahindra Ltd. 1.35 State Bank of India 1.31 State Bank of India 1.31
 Hindustan Unilever Ltd. 1.30 Hindustan Unilever Ltd. 1.30 Vardhman Textiles Ltd. 1.30 Vardhman Textiles Ltd. 1.30 Larsen & Toubro Ltd. 1.24 Larsen & Toubro Ltd. 1.24 Dabur India Ltd. 1.2
2 Neogen Chemicals Ltd. 1.10 Neogen Chemicals Ltd. 1.10 Eicher Motors Ltd. 1.09 Eicher Motors Ltd. 1.09 Thermax Ltd. 1.08 TATA Consultancy Services Ltd. 1.05 TATA Consultancy Services Ltd
. 1.05 Indian Railway Catering & Tourism Corpn. Ltd. 0.98 Indian Railway Catering & Tourism Corpn. Ltd. 0.98 Firstsource Solutions Ltd. 0.97 Nestle India Ltd. 0.86 Nestle India Ltd. 0.86
Asian Paints Ltd. 0.84 Asian Paints Ltd. 0.84 Welspun India Ltd. 0.72 IndusInd Bank Ltd. 0.63 IndusInd Bank Ltd. 0.63 SBI Life Insurance Co. Ltd. 0.50 SBI Life Insurance Co. Ltd. 0.50 Dee
pak Nitrite Ltd. 0.46 Adani Ports and Special Economic Zone Ltd. 0.36 Adani Ports and Special Economic Zone Ltd. 0.36 Gateway Distriparks Ltd. 0.33 Gateway Distriparks Ltd. 0.33 Bharat Fo
rge Ltd. 0.22 tREPS on G-Sec or t-Bills 2.81 tREPS on G-Sec or t-Bills 2.81 Cash & Cash Receivables 0.32 Cash & Cash Receivables 0.32 tOtAL']

On this list I am trying to add a comma after every text after which I wanted to convert the list into pandas dataframe. I tried using this:

transformed_text = [",".join(i.split(" ")) for i in text]
print(transformed_text)

which gave me this:

['Security,Name,%,to,Net,Assets*,Assets*,DEBENtURES,0.04,Britannia,Industries,Ltd.,0.04,Britannia,Industries,Ltd.,0.04,EQUity,&,EQUity,RELAtED,96.83,EQUity,&,EQUity,RELAtED,96.83,HDFC,Ban
k,Ltd.,6.98,HDFC,Bank,Ltd.,6.98,ICICI,Bank,Ltd.,4.82,ICICI,Bank,Ltd.,4.82,Infosys,Ltd.,4.37,Infosys,Ltd.,4.37,Reliance,Industries,Ltd.,4.05,Reliance,Industries,Ltd.,4.05,Bajaj\tFinance\tL
td.,3.82,Bajaj\tFinance\tLtd.,3.82,Housing,Development,Finance,Corpn.,Ltd.,3.23,Housing,Development,Finance,Corpn.,Ltd.,3.23,Grindwell,Norton,Ltd.,3.22,SRF,Ltd.,3.22,SRF,Ltd.,3.22,Sun,Pha
rmaceutical,Industries,Ltd.,2.85,Sun,Pharmaceutical,Industries,Ltd.,2.85,Bharti,Airtel,Ltd.,2.82,DLF,Ltd.,2.64,DLF,Ltd.,2.64,Ultratech,Cement,Ltd.,2.62,Ultratech,Cement,Ltd.,2.62,SKF,Indi
a,Ltd.,2.45,Crompton,Greaves,Consumer,Electricals,Ltd.,2.42,Crompton,Greaves,Consumer,Electricals,Ltd.,2.42,Avenue,Supermarts,Ltd.,2.41,Avenue,Supermarts,Ltd.,2.41,Axis,Bank,Ltd.,2.41,ABB
,India,Ltd.,2.35,ABB,India,Ltd.,2.35,Titan,Co.,Ltd.,2.29,Titan,Co.,Ltd.,2.29,Kotak,Mahindra,Bank,Ltd.,2.09,Cipla,Ltd.,2.05,Cipla,Ltd.,2.05,Laurus,Labs,Ltd.,2.04,Laurus,Labs,Ltd.,2.04,Wipr
o,Ltd.,1.77,Happiest,Minds,Technologies,Ltd.,1.68,Happiest,Minds,Technologies,Ltd.,1.68,Canara,Bank,1.67,Canara,Bank,1.67,Shree,Cement,Ltd.,1.63,Security,Name,%,to,Net,Assets*,Assets*,Mah
indra,&,Mahindra,Ltd.,1.59,Pidilite,Industries,Ltd.,1.50,Pidilite,Industries,Ltd.,1.50,ICICI,Lombard,General,Insurance,Co.,Ltd.,1.48,ICICI,Lombard,General,Insurance,Co.,Ltd.,1.48,Cholaman
dalam,Investment,&,Finance,Co.,Ltd.,1.45,Cholamandalam,Investment,&,Finance,Co.,Ltd.,1.45,Tech,Mahindra,Ltd.,1.35,Tech,Mahindra,Ltd.,1.35,State,Bank,of,India,1.31,State,Bank,of,India,1.31
,Hindustan,Unilever,Ltd.,1.30,Hindustan,Unilever,Ltd.,1.30,Vardhman,Textiles,Ltd.,1.30,Vardhman,Textiles,Ltd.,1.30,Larsen,&,Toubro,Ltd.,1.24,Larsen,&,Toubro,Ltd.,1.24,Dabur,India,Ltd.,1.2
2,Neogen,Chemicals,Ltd.,1.10,Neogen,Chemicals,Ltd.,1.10,Eicher,Motors,Ltd.,1.09,Eicher,Motors,Ltd.,1.09,Thermax,Ltd.,1.08,TATA,Consultancy,Services,Ltd.,1.05,TATA,Consultancy,Services,Ltd
.,1.05,Indian,Railway,Catering,&,Tourism,Corpn.,Ltd.,0.98,Indian,Railway,Catering,&,Tourism,Corpn.,Ltd.,0.98,Firstsource,Solutions,Ltd.,0.97,Nestle,India,Ltd.,0.86,Nestle,India,Ltd.,0.86,
Asian,Paints,Ltd.,0.84,Asian,Paints,Ltd.,0.84,Welspun,India,Ltd.,0.72,IndusInd,Bank,Ltd.,0.63,IndusInd,Bank,Ltd.,0.63,SBI,Life,Insurance,Co.,Ltd.,0.50,SBI,Life,Insurance,Co.,Ltd.,0.50,Dee
pak,Nitrite,Ltd.,0.46,Adani,Ports,and,Special,Economic,Zone,Ltd.,0.36,Adani,Ports,and,Special,Economic,Zone,Ltd.,0.36,Gateway,Distriparks,Ltd.,0.33,Gateway,Distriparks,Ltd.,0.33,Bharat,Fo
rge,Ltd.,0.22,tREPS,on,G-Sec,or,t-Bills,2.81,tREPS,on,G-Sec,or,t-Bills,2.81,Cash,&,Cash,Receivables,0.32,Cash,&,Cash,Receivables,0.32,tOtAL']

However with this I lose a lot of relevant data..ie if I wanted Britannia Industries Ltd. I get Britannia,Industries,Ltd. There are more like these if you closely look, which is not I would like: I want the words to be seperated in a list such that:

 [Britannia Industries Ltd, 0.04, Asian Paints Ltd., 0.84 etc]

How would I have to do make it work? Please help.

EDIT: I have fixed the data at root level and the list which I get is this:

['Security Name % to Net Assets* DEBENtURES 0.04 Britannia Industries Ltd. EQUity & RELAtED 96.83 HDFC Bank 6.98 ICICI 4.82 Infosys 4.37 Reliance 4.05 Bajaj Finance 3.82 Housing Developme
nt Corpn. 3.23 Grindwell Norton 3.22 SRF Sun Pharmaceutical 2.85 Bharti Airtel 2.82 DLF 2.64 Ultratech Cement 2.62 SKF India 2.45 Crompton Greaves Consumer Electricals 2.42 Avenue Superma
rts 2.41 Axis ABB 2.35 Titan Co. 2.29 Kotak Mahindra 2.09 Cipla 2.05 Laurus Labs 2.04 Wipro 1.77 Happiest Minds Technologies 1.68 Canara 1.67 Shree 1.63 1.59 Pidilite 1.50 Lombard General
 Insurance 1.48 Cholamandalam Investment 1.45 Tech 1.35 State of 1.31 Hindustan Unilever 1.30 Vardhman Textiles Larsen Toubro 1.24 Dabur 1.22 Neogen Chemicals 1.10 Eicher Motors 1.09 Ther
max 1.08 TATA Consultancy Services 1.05 Indian Railway Catering Tourism 0.98 Firstsource Solutions 0.97 Nestle 0.86 Asian Paints 0.84 Welspun 0.72 IndusInd 0.63 SBI Life 0.50 Deepak Nitri
te 0.46 Adani Ports and Special Economic Zone 0.36 Gateway Distriparks 0.33 Bharat Forge 0.22 tREPS on G-Sec or t-Bills 2.81 Cash Receivables 0.32 tOtAL']

CodePudding user response:

You might be able to use re.sub here:

inp = '''Security Name % to Net Assets* DEBENtURES 0.04 Britannia Industries Ltd. EQUity & RELAtED 96.83 HDFC Bank 6.98 ICICI 4.82 Infosys 4.37 Reliance 4.05 Bajaj Finance 3.82 Housing Developme
nt Corpn. 3.23 Grindwell Norton 3.22 SRF Sun Pharmaceutical 2.85 Bharti Airtel 2.82 DLF 2.64 Ultratech Cement 2.62 SKF India 2.45 Crompton Greaves Consumer Electricals 2.42 Avenue Superma
rts 2.41 Axis ABB 2.35 Titan Co. 2.29 Kotak Mahindra 2.09 Cipla 2.05 Laurus Labs 2.04 Wipro 1.77 Happiest Minds Technologies 1.68 Canara 1.67 Shree 1.63 1.59 Pidilite 1.50 Lombard General
Insurance 1.48 Cholamandalam Investment 1.45 Tech 1.35 State of 1.31 Hindustan Unilever 1.30 Vardhman Textiles Larsen Toubro 1.24 Dabur 1.22 Neogen Chemicals 1.10 Eicher Motors 1.09 Ther
max 1.08 TATA Consultancy Services 1.05 Indian Railway Catering Tourism 0.98 Firstsource Solutions 0.97 Nestle 0.86 Asian Paints 0.84 Welspun 0.72 IndusInd 0.63 SBI Life 0.50 Deepak Nitri
te 0.46 Adani Ports and Special Economic Zone 0.36 Gateway Distriparks 0.33 Bharat Forge 0.22 tREPS on G-Sec or t-Bills 2.81 Cash Receivables 0.32 tOtAL
'''
output = re.sub(r'\b(\d (?:\.\d )?)\b', r'\1,', inp)
print(output)

This prints:

Security Name % to Net Assets* DEBENtURES 0.04, Britannia Industries Ltd. EQUity & RELAtED 96.83, HDFC Bank 6.98, ICICI 4.82, Infosys 4.37, Reliance 4.05, Bajaj Finance 3.82, Housing Developme nt Corpn. 3.23, Grindwell Norton 3.22, SRF Sun Pharmaceutical 2.85, Bharti Airtel 2.82, DLF 2.64, Ultratech Cement 2.62, SKF India 2.45, Crompton Greaves Consumer Electricals 2.42, Avenue Superma rts 2.41, Axis ABB 2.35, Titan Co. 2.29, Kotak Mahindra 2.09, Cipla 2.05, Laurus Labs 2.04, Wipro 1.77, Happiest Minds Technologies 1.68, Canara 1.67, Shree 1.63, 1.59, Pidilite 1.50, Lombard General Insurance 1.48, Cholamandalam Investment 1.45, Tech 1.35, State of 1.31, Hindustan Unilever 1.30, Vardhman Textiles Larsen Toubro 1.24, Dabur 1.22, Neogen Chemicals 1.10, Eicher Motors 1.09, Ther max 1.08, TATA Consultancy Services 1.05, Indian Railway Catering Tourism 0.98, Firstsource Solutions 0.97, Nestle 0.86, Asian Paints 0.84, Welspun 0.72, IndusInd 0.63, SBI Life 0.50, Deepak Nitri te 0.46, Adani Ports and Special Economic Zone 0.36, Gateway Distriparks 0.33, Bharat Forge 0.22, tREPS on G-Sec or t-Bills 2.81, Cash Receivables 0.32, tOtAL

  • Related