I have a dataframe contains one column which has multiple strings separated by the comma, but in this string, I want to remove all matter after hyphen (including hyphen), main point is after in some cases hyphen is not there but directed parenthesis is there so I also want to remove that as well and carry all the after the comma how can I do it? You can see this case in last row.
dd = pd.DataFrame()
dd['sin'] = ['U147(BCM), U35(BCM)','P01-00(ECM), P02-00(ECM)', 'P3-00(ECM), P032-00(ECM)','P034-00(ECM)', 'P23F5(PCM), P04-00(ECM)']
Expected output
dd['sin']
# output
U147 U35
P01 P02
P3 P032
P034
P23F5 P04
Want to carry only string before the hyphen or parenthesis or any special character.
CodePudding user response:
The following code seems to reproduce your desired result:
dd['sin'] = dd['sin'].str.split(", ")
dd = dd.explode('sin').reset_index()
dd['sin'] = dd['sin'].str.replace('\W.*', '', regex=True)
Which gives dd['sin']
as:
0 U147
1 U35
2 P01
3 P02
4 P3
5 P032
6 P034
7 P23F5
8 P04
Name: sin, dtype: object
The call of .reset_index()
in the second line is optional depending on whether you want to preserve which row that piece of the string came from.
CodePudding user response:
You can use the following regex:
r"-\d{2}|\([EBP]CM\)|\s"
Here is the code:
sin = ['U147(BCM), U35(BCM)','P01-00(ECM), P02-00(ECM)', 'P3-00(ECM), P032-00(ECM)','P034-00(ECM)', 'P23F5(PCM), P04-00(ECM)']
dd = pd.DataFrame()
dd['sin'] = sin
dd['sin'] = dd['sin'].str.replace(r'-\d{2}|\([EBP]CM\)|\s', '', regex=True)
print(dd)
OUTPUT:
sin
0 U147,U35
1 P01,P02
2 P3,P032
3 P034
4 P23F5,P04
EDIT
Or use this line to remove the comma:
dd['sin'] = dd['sin'].str.replace(r'-\d{2}|\([EBP]CM\)|\s', '', regex=True).str.replace(',',' ')
OUTPUT:
sin
0 U147 U35
1 P01 P02
2 P3 P032
3 P034
4 P23F5 P04