I have a dataset where I have a dataframe['Title'] column with car brand and accessories information.
I want to 2 new columns dataframe['Brand'] and dataframe['model'], where I want to get the brand name of the vehicle and the model of the vehicle. Like Mahindra brand and XUV300 as per the first record. and want brand --> universal and model --> NaN if the record is like the second entry --> Blaupunkt Colombo 130 BT.
What I tried:-
brand = []
for i in vehicle_make:
for j in range(len(df2['Title'])):
text = df2['Title'][j].lstrip().lower()
print(text)
if i in text:
df2['brand'][j] = i
print("yes")
else:
df2['brand'][j] = 'Unversal'
print('No')
where vehicle_make contains brand names.
['ford',
'honda',
'hyundai',
'Kia',
'mahindra',
'maruti',
'mg',
'nissan',
'renault',
'skoda',
'tata',
'toyota',
'volkswagen']
which I scrapped from the same website.
The above code is working but it's not picking all the values.
mahindra xuv 300 led door foot step sill plate mirror finish black glossy
No
blaupunkt colombo 130 bt
No
nissan terrano mud flap /mud guard
No
mg hector plus 4d boss leatherite car floor mat black( without grass mat)
No
ford endeavour body protection waterproof car cover (grey)
yes
starid wiper blade framless for volkswagen polo (size 24' and 16'' ) black
No
mahindra tuv300 rain door visor without chrome line (2015-2017)
No
This is the output I am getting. What is wrong here?
CodePudding user response:
This answer assumes that all text following a brand name would correspond to the model. We can form a regex alternation of brand names, and then use str.extract
with this alternation.
brands = ['ford', 'honda', 'hyundai', 'Kia', 'mahindra', 'maruti', 'mg', 'nissan', 'renault', 'skoda', 'tata', 'toyota', 'volkswagen']
regex = r'\b(' r'|'.join(brands) r') (.*)'
df2[["Title", "Brand"]] = df2["Title"].str.extract(regex, flags=re.I)
CodePudding user response:
USE- df.loc[df.brand == '', 'brand'] = df.Title.str.split().str.get(0)
code is not tested as you didn't share minimal reproducible code.
https://stackoverflow.com/help/minimal-reproducible-example
REf link- pandas dataframe return first word in string for column
Updated Answer-
As first letter word is Brand Name
in df['Title']
. So in this method,list is being created by extracting 1st word from columndf['Title']
and stored it in list.Now list will scan for particular word present in each row of df['Title']
and extract that word and store it in new column.
If brand name contain two words for example Alpha Romeo
or Mercedez-Benz
, it also has been taken care in code.
Reproducible Code with similar use case-
# Import pandas library
import pandas as pd
# initialize list of lists
data = ['Toyota Yaris HB s', 'BMW Serie 3 340 M xDrive','Baic','Bmw', 'GMC World','BAIC Green','Alfa Romeo Mito Hatch Back Mito Veloce','Mercedes-Benz A3212 Model']
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['car-name'])
df['make'] = df['car-name'].str.split().str.get(0)
#Making lst from dataframe column df['make']
lst = df['make'].tolist()#Create list of brands
df['brands'] = df['car-name'].apply(lambda x: ';'.join([m for m in lst if m in x])).replace('',np.nan)
#Special condition if df['brand'] contains two words
df['brands'] = np.select([df['brands'].str.contains('Alfa')], [df['brands'] 'Romeo'], df['brands'])
# print dataframe.
df
Output-
car-name make brands
0 Toyota Yaris HB s Toyota Toyota
1 BMW Serie 3 340 M xDrive BMW BMW
2 Baic Baic Baic
3 Bmw Bmw Bmw
4 GMC World GMC GMC
5 BAIC Green BAIC BAIC
6 Alfa Romeo Mito Hatch Alfa AlfaRomeo
7 Mercedes-Benz A3212 Mode Mercedes-Benz Mercedes-Benz
Ref link- How to replace values in a pandas dataframe with the ones on a list by searching for similar values?