Home > Back-end >  Removing words to the right of a string in a dataframe
Removing words to the right of a string in a dataframe

Time:09-02

I have a data frame that contains a description of services performed on a vehicle. I would like to remove the word VIN followed by all words to the right of it.

Description
Install Ceramic Film on the front 2 roll up on a 18 CX-5 Vin.#J1452239 St.#G3056 Per.Wally/Joseph Blenkinsop
Install Ceramic Film on the front 2 roll up on a 18 Terrain Vin.#JL225216 St.#218369 Per.Toby
Install Ceramic Film on the front 2 roll up on a 18 Terrain Vin.#JL286535 St.# Per.Tanner
Install Ceramic Film on the front 2 roll up on a 18 Yukon Vin.#JR297209 St.# Per.Randy/victoria Celaya
Install Ceramic Film on the front 2 roll up on a 19 1500 Vin.#KG174232 St.# Per.Colby
Install Ceramic Film on the front 2 roll up on a 19 1500 Vin.#KG265533 St.# Per.Colby
Install Ceramic Film on the front 2 roll up on a 19 Blazer Vin.#KS644905 St.# Per. J.C.

I have not gotten any useful output using prefix/suffix. Thank you.

CodePudding user response:

We can use str.replace here:

df["Description"] = df["Description"].str.replace(r'\s*\bVin\b.*$', '', regex=True, flags=re.I)

CodePudding user response:

Using pandas

Since your strings are in a DataFrame, you can use pandas methods directly rather than calling python standard library functions on individual strings.

Most method from python's str are available in pandas, directly applicable to a Series or DataFrame.

import pandas as pd

df = pd.DataFrame({'description': '''Install Ceramic Film on the front 2 roll up on a 18 CX-5 Vin.#J1452239 St.#G3056 Per.Wally/Joseph Blenkinsop
Install Ceramic Film on the front 2 roll up on a 18 Terrain Vin.#JL225216 St.#218369 Per.Toby
Install Ceramic Film on the front 2 roll up on a 18 Terrain Vin.#JL286535 St.# Per.Tanner
Install Ceramic Film on the front 2 roll up on a 18 Yukon Vin.#JR297209 St.# Per.Randy/victoria Celaya
Install Ceramic Film on the front 2 roll up on a 19 1500 Vin.#KG174232 St.# Per.Colby
Install Ceramic Film on the front 2 roll up on a 19 1500 Vin.#KG265533 St.# Per.Colby
Install Ceramic Film on the front 2 roll up on a 19 Blazer Vin.#KS644905 St.# Per. J.C.'''.split('\n')})

df['short description'] = df['description'].str.split('Vin', 1).str[0]

print(df['description'][0])
# Install Ceramic Film on the front 2 roll up on a 18 CX-5 Vin.#J1452239 St.#G3056 Per.Wally/Joseph Blenkinsop

print(df['short description'][0])
# Install Ceramic Film on the front 2 roll up on a 18 CX-5 

Using python's builtin str

Without using any regexps, you can use str.split to split the string on 'Vin', then keep only the left half:

s = 'Install Ceramic Film on the front 2 roll up on a 18 CX-5 Vin.#J1452239 St.#G3056 Per.Wally/Joseph Blenkinsop'

left, right = s.split('Vin', maxsplit=1)

print(left)

Output:

Install Ceramic Film on the front 2 roll up on a 18 CX-5 

Or alternatively:

left = s.split('Vin', maxsplit=1)[0]

If you wanted to split on the last occurrence of 'Vin' instead of the first occurrence, then you could use rsplit instead of split.

Relevant documentation:

  • Related