I have a data frame that contains a description of services performed on a vehicle. I would like to remove the word VIN followed by all words to the right of it.
Description
Install Ceramic Film on the front 2 roll up on a 18 CX-5 Vin.#J1452239 St.#G3056 Per.Wally/Joseph Blenkinsop
Install Ceramic Film on the front 2 roll up on a 18 Terrain Vin.#JL225216 St.#218369 Per.Toby
Install Ceramic Film on the front 2 roll up on a 18 Terrain Vin.#JL286535 St.# Per.Tanner
Install Ceramic Film on the front 2 roll up on a 18 Yukon Vin.#JR297209 St.# Per.Randy/victoria Celaya
Install Ceramic Film on the front 2 roll up on a 19 1500 Vin.#KG174232 St.# Per.Colby
Install Ceramic Film on the front 2 roll up on a 19 1500 Vin.#KG265533 St.# Per.Colby
Install Ceramic Film on the front 2 roll up on a 19 Blazer Vin.#KS644905 St.# Per. J.C.
I have not gotten any useful output using prefix/suffix. Thank you.
CodePudding user response:
We can use str.replace
here:
df["Description"] = df["Description"].str.replace(r'\s*\bVin\b.*$', '', regex=True, flags=re.I)
CodePudding user response:
Using pandas
Since your strings are in a DataFrame, you can use pandas methods directly rather than calling python standard library functions on individual strings.
Most method from python's str
are available in pandas, directly applicable to a Series or DataFrame.
import pandas as pd
df = pd.DataFrame({'description': '''Install Ceramic Film on the front 2 roll up on a 18 CX-5 Vin.#J1452239 St.#G3056 Per.Wally/Joseph Blenkinsop
Install Ceramic Film on the front 2 roll up on a 18 Terrain Vin.#JL225216 St.#218369 Per.Toby
Install Ceramic Film on the front 2 roll up on a 18 Terrain Vin.#JL286535 St.# Per.Tanner
Install Ceramic Film on the front 2 roll up on a 18 Yukon Vin.#JR297209 St.# Per.Randy/victoria Celaya
Install Ceramic Film on the front 2 roll up on a 19 1500 Vin.#KG174232 St.# Per.Colby
Install Ceramic Film on the front 2 roll up on a 19 1500 Vin.#KG265533 St.# Per.Colby
Install Ceramic Film on the front 2 roll up on a 19 Blazer Vin.#KS644905 St.# Per. J.C.'''.split('\n')})
df['short description'] = df['description'].str.split('Vin', 1).str[0]
print(df['description'][0])
# Install Ceramic Film on the front 2 roll up on a 18 CX-5 Vin.#J1452239 St.#G3056 Per.Wally/Joseph Blenkinsop
print(df['short description'][0])
# Install Ceramic Film on the front 2 roll up on a 18 CX-5
Using python's builtin str
Without using any regexps, you can use str.split
to split the string on 'Vin'
, then keep only the left half:
s = 'Install Ceramic Film on the front 2 roll up on a 18 CX-5 Vin.#J1452239 St.#G3056 Per.Wally/Joseph Blenkinsop'
left, right = s.split('Vin', maxsplit=1)
print(left)
Output:
Install Ceramic Film on the front 2 roll up on a 18 CX-5
Or alternatively:
left = s.split('Vin', maxsplit=1)[0]
If you wanted to split on the last occurrence of 'Vin'
instead of the first occurrence, then you could use rsplit
instead of split
.
Relevant documentation: