Get a particular text from a column in a dataframe-CodePudding

I have a dataset where I have a dataframe['Title'] column with car brand and accessories information.

I want to 2 new columns dataframe['Brand'] and dataframe['model'], where I want to get the brand name of the vehicle and the model of the vehicle. Like Mahindra brand and XUV300 as per the first record. and want brand --> universal and model --> NaN if the record is like the second entry --> Blaupunkt Colombo 130 BT.

What I tried:-

brand = []
for i in vehicle_make:
    for j in range(len(df2['Title'])):
        text = df2['Title'][j].lstrip().lower()
        print(text)
        if i in text:
            df2['brand'][j] = i
            print("yes")
        else:
            df2['brand'][j] = 'Unversal'
            print('No')

where vehicle_make contains brand names.

['ford',
 'honda',
 'hyundai',
 'Kia',
 'mahindra',
 'maruti',
 'mg',
 'nissan',
 'renault',
 'skoda',
 'tata',
 'toyota',
 'volkswagen']

which I scrapped from the same website.

The above code is working but it's not picking all the values.

mahindra xuv 300 led door foot step sill plate mirror finish black glossy
No
blaupunkt colombo 130 bt
No
nissan terrano mud flap /mud guard
No
mg hector plus 4d boss leatherite car floor mat black( without grass mat)
No
ford endeavour body protection waterproof car cover (grey)
yes
starid wiper blade framless for volkswagen polo (size 24' and 16'' ) black
No
mahindra tuv300 rain door visor without chrome line (2015-2017)
No

This is the output I am getting. What is wrong here?

CodePudding user response：

This answer assumes that all text following a brand name would correspond to the model. We can form a regex alternation of brand names, and then use str.extract with this alternation.

brands = ['ford', 'honda', 'hyundai', 'Kia', 'mahindra', 'maruti', 'mg', 'nissan', 'renault', 'skoda', 'tata', 'toyota', 'volkswagen']
regex = r'\b('   r'|'.join(brands)   r') (.*)'
df2[["Title", "Brand"]] = df2["Title"].str.extract(regex, flags=re.I)

CodePudding user response：

USE- df.loc[df.brand == '', 'brand'] = df.Title.str.split().str.get(0)

code is not tested as you didn't share minimal reproducible code.

https://stackoverflow.com/help/minimal-reproducible-example

REf link- pandas dataframe return first word in string for column

Updated Answer-

As first letter word is Brand Name in df['Title']. So in this method,list is being created by extracting 1st word from columndf['Title'] and stored it in list.Now list will scan for particular word present in each row of df['Title'] and extract that word and store it in new column.

If brand name contain two words for example Alpha Romeo or Mercedez-Benz, it also has been taken care in code.

Reproducible Code with similar use case-

# Import pandas library
import pandas as pd
  
# initialize list of lists
data = ['Toyota Yaris HB s', 'BMW Serie 3 340 M xDrive','Baic','Bmw', 'GMC World','BAIC Green','Alfa Romeo Mito Hatch Back Mito Veloce','Mercedes-Benz A3212 Model']
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['car-name'])
df['make'] = df['car-name'].str.split().str.get(0)
#Making lst from dataframe column df['make']
lst = df['make'].tolist()#Create list of brands
df['brands'] = df['car-name'].apply(lambda x: ';'.join([m for m in lst if m in x])).replace('',np.nan)
#Special condition if df['brand'] contains two words
df['brands'] = np.select([df['brands'].str.contains('Alfa')], [df['brands']  'Romeo'], df['brands'])

# print dataframe.
df

Output-

      car-name                   make   brands
0   Toyota Yaris HB s            Toyota Toyota
1   BMW Serie 3 340 M xDrive       BMW  BMW
2   Baic                           Baic Baic
3   Bmw                            Bmw  Bmw
4   GMC World                      GMC  GMC
5   BAIC Green                     BAIC BAIC
6   Alfa Romeo Mito Hatch          Alfa AlfaRomeo
7   Mercedes-Benz A3212 Mode  Mercedes-Benz Mercedes-Benz

Ref link- How to replace values in a pandas dataframe with the ones on a list by searching for similar values?