Home > Software design >  efficient way of removing leading and trailing whitepaces around dash in text strings and then split
efficient way of removing leading and trailing whitepaces around dash in text strings and then split

Time:05-20

Suppose the pandas dataframe contains the following:

 import pandas as pd
 df = pd.DataFrame({'text': ['ABC - XYZ- Some Text', 'DEF- XYZ -sometext', 'GHI -XYZ - sometext', 'JKL-XYZ- sometext', 'MNO1- XYZ- some text', 'MNO2 - XYZ - some text', 'MNO3 - XYZ-some text', 'MNO4-XYZ -some text', 'MNO5- XYZ-sometext -someother text', 'MNO6 -XYZ -sometext-someother text']})

All I want to do is to remove leading and training white spaces around 'only' the dashes and then split the data into new (multiple) columns of a new dataframe. So that the new dataframe should look like this:

 Col1    Col2    Col3           Col4                Col5    Col6 ....
 ABC     XYZ     Some Text     none                 none    none
 DEF     XYZ     sometext     none                  none    none
 GHI     XYZ     sometext     none                  none    none
 JKL     XYZ     sometext     none                  none    none
 .
 .
 MNO6    XYZ     sometext       someother text      none    none

Basically depending upon the highest number of dashes, there will be the columns in the new dataframe (e.g. if it is 6 dashes then there will be 6 columns) and where ever there are no values for a column after split, there will be none values.

Now, what I am trying to do something like this:

df1 = df['text'].str.split(' - ', n=2, expand=True)
df1.columns = ['Col_1_{}'.format(x 1) for x in df1.columns]

and then 

df2 = df1['Col_1_1'].str.split('- ', n=1, expand=True)
df2.columns = ['Col_1_1_{}'.format(x 1) for x in df2.columns]

and so on so that later I can merge all these columns and do renaming of these.

But this seems not to be efficient, sorry as I am not a pro python :'(

Is there an efficient way of achieving the result the way I want? any suggestions would be appreciated.

CodePudding user response:

You can use

df['text'].str.split(r'\s*[—–-]\s*', expand=True)

The \s*[—–-]\s* matches any hyphen, en- or em-dash and any zero or more whitespace chars on its left and right.

The expand=True will make it output a dataframe.

See the Pandas test:

import pandas as pd
df = pd.DataFrame({'text': ['ABC - XYZ- Some Text', 'DEF- XYZ -sometext', 'GHI -XYZ - sometext', 'JKL-XYZ- sometext', 'MNO1- XYZ- some text', 'MNO2 - XYZ - some text', 'MNO3 - XYZ-some text', 'MNO4-XYZ -some text', 'MNO5- XYZ-sometext -someother text', 'MNO6 -XYZ -sometext-someother text']})
df1 = df['text'].str.split(r'\s*[—–-]\s*', expand=True)
df1.columns = ['Col{}'.format(x 1) for x in df1.columns]

>>> df1
   Col1 Col2       Col3            Col4
0   ABC  XYZ  Some Text            None
1   DEF  XYZ   sometext            None
2   GHI  XYZ   sometext            None
3   JKL  XYZ   sometext            None
4  MNO1  XYZ  some text            None
5  MNO2  XYZ  some text            None
6  MNO3  XYZ  some text            None
7  MNO4  XYZ  some text            None
8  MNO5  XYZ   sometext  someother text
9  MNO6  XYZ   sometext  someother text
  • Related