I want to remove the word DISCONTINUED
from a column using a loop and train the model for further use. The code that I have tried:
# 1.
for i in df['DESCRIPTION'][0]:
if i[0]== 'DISCONTINUED':
df.i[0].pop(0)
# 2.
for item in df['DESCRIPTION']:
if str(item)[0]=='DISCONTINUED':
df.remove(item[0])
Note: df
is the dataset name and DESCRIPTION
is the column name. The column has the dtype object
. I have tried to convert that into str
, but it didn't work.
The values in the column DESCRIPTION are:
data = {'DESCRIPTION':
['ANDREW, 245173, 1/2-1/2 COLD SHRINK KIT, CEQ.24038',
'COMMSCOPE, 245174, 1/2, 3/8 COLDSRINK WTHRPRFNG KIT, CEQ.24753',
'DISCONTINUED, COMMSCOPE, 252107, LACE UP I-LINE HOISTING GRIP FOR 1/2 CABLES',
'COMMSCOPE, 252110, LACE UP HOISTING GRIP FOR 1-1/4 COAX & EW63/64 WAVEGUIDE',
'ANDREW, 252111, 1-5/8 HOISTING GRIP, LACE UP']}
I want to remove DISCONTINUED
from a column having multiple values in it separated by a comma.
CodePudding user response:
Your initial approaches fail, in part because you are treating the values in df.DESCRIPTION
as if they are lists that contain strings, rather than simply strings. E.g.:
print(type(df['DESCRIPTION'][0]))
<class 'str'>
You can use Series.replace
in this case:
Data
import pandas as pd
# just adding some extra `DISCONTINUED` in `3, 4`
data = {'DESCRIPTION': ['ANDREW, 245173, 1/2-1/2 COLD SHRINK KIT, CEQ.24038',
'COMMSCOPE, 245174, 1/2, 3/8 COLDSRINK WTHRPRFNG KIT, CEQ.24753',
'DISCONTINUED, COMMSCOPE, 252107, LACE UP I-LINE HOISTING GRIP FOR 1/2 CABLES',
'COMMSCOPE, 252110, DISCONTINUED, LACE UP HOISTING GRIP FOR 1-1/4 COAX & EW63/64 WAVEGUIDE',
'ANDREW, 252111, 1-5/8 HOISTING GRIP, LACE UP, DISCONTINUED']}
df = pd.DataFrame(data)
print(df)
DESCRIPTION
0 ANDREW, 245173, 1/2-1/2 COLD SHRINK KIT, CEQ.24038
1 COMMSCOPE, 245174, 1/2, 3/8 COLDSRINK WTHRPRFNG KIT, CEQ.24753
2 DISCONTINUED, COMMSCOPE, 252107, LACE UP I-LINE HOISTING GRIP FOR 1/2 CABLES
3 COMMSCOPE, 252110, DISCONTINUED, LACE UP HOISTING GRIP FOR 1-1/4 COAX & EW63/64 WAVEGUIDE
4 ANDREW, 252111, 1-5/8 HOISTING GRIP, LACE UP, DISCONTINUED
# So, we have `DISCONTINUED` in `2` (at start), in `3` (third "elem"), and `4` (at end).
Code
df.DESCRIPTION = df.DESCRIPTION.replace(r',?\s?DISCONTINUED,?\s?','', regex=True)
print(df)
DESCRIPTION
0 ANDREW, 245173, 1/2-1/2 COLD SHRINK KIT, CEQ.24038
1 COMMSCOPE, 245174, 1/2, 3/8 COLDSRINK WTHRPRFNG KIT, CEQ.24753
2 COMMSCOPE, 252107, LACE UP I-LINE HOISTING GRIP FOR 1/2 CABLES
3 COMMSCOPE, 252110LACE UP HOISTING GRIP FOR 1-1/4 COAX & EW63/64 WAVEGUIDE
4 ANDREW, 252111, 1-5/8 HOISTING GRIP, LACE UP
Explanation regex pattern ,?\s?DISCONTINUED,?\s?
:
?
matchees the previous token between zero and one times. E.g. we use it for potential commas (,
) and spaces (\s
) before and/or afterDISCONTINUED
.
Alternative method could be:
df.DESCRIPTION = df.DESCRIPTION.apply(lambda row: ', '.join(
[c for c in row.split(', ') if c != 'DISCONTINUED']))
CodePudding user response:
If your data set is a matrix where the first column could contain the word discontinued then it should probably look something like:
for row in df['DESCRIPTION']:
if row[0] == 'DISCONTINUED'
df.remove(row[0])
The issues with the code you have tried are that #1 only iterates the first row of df and #2 is only checking the first character of the string created by str(item) so it will never equal Discontinued. Changing str(item)[0] to str(item[0]) could also work, but if it's already a string there's no reason to 'reconvert' it to a string.