Is there a way for me to optimize these loops through vectorization in order for this to be much fas-CodePudding

Basically the function just checks if ever some string i from a dataframe which consists of a column of name, is found in each of the list and then it returns that value. the first loop is responsible for checking the text distance which makes the process run for hours, is there any way for this to be converted through vectorization for example for this to run faster?

def check_segment(name):
    name = re.sub('\W ', ' ', name)
    name = re.sub('\*', '', name)
    
    for i in top_1k:
        x = textdistance.jaro_winkler(i, name)
        if x > 0.8:
            return 'Top 1K'
            
    for i in school:
        if i in name:
            return 'School'
    
    for i in hospital:
        if i in name:
            return 'Hospital'
        
    for i in govt:
        if i in name:
            return 'Government'
    
    for i in coop:
        if i in name:
            return 'Cooperative'
    
    for i in banks:
        if i in name:
            return 'Bank'
    
    for i in sarisari:
        if i in name:
            return 'Sari-Sari Store'
        
    for i in malls:
        if i in name:
            return 'Malls'
    
    for i in remittance_center:
        if i in name:
            return 'Remittance Center'
    
    for i in hotel:
        if i in name:
            return 'Hotels'
    
    for i in foundation:
        if i in name:
            return 'Foundation'
    
    for i in embassy:
        if i in name:
            return 'Embassy'
    
    return 'SME'

lists are created wherein if a certain word from the string matches a word from the list, you will be label:

school = ["UNIVERSITY","ACADEMY","COLLEGE","ACADEME","SCHOOL","MONTESSORI","ELEMENTARY","HIGH SCHOOL","COLLEGIO","INSTITUTE"]
hospital = ["HOSPITAL","LABORATORY","CLINIC","MEDICAL","DIAGNOSTIC","HEALTH","DOCTOR", "HEALTHCARE"]
govt = ["DEPARTMENT OF","CITY GOVERNMENT","OFFICE OF THE","PROVINCE OF","PROVINCIAL","CITY TREASURER","REGISTRY OF","REGISTER OF",
          "BUREAU OF","MUNICIPAL","COMMISSION","PEZA","HDMF","WATER DISTRICT","HOME DEVELOPMENT MUTUAL FUND","CLERK OF COURT", 
          "CITY OF","BARANGAY", "GOVERNMENT"]
coop = ["COOP", "COOPERATIVE"]
hotel = ["HOTEL","RESORT", "CONDOTEL", "HOTELIERS", "INN"]
foundation = ["FOUNDATION"]
embassy = ["EMBASSY"]

df['segment'] = df['name'].apply(check_segment)

The input dataframe is:

Name
WORLD FOUNDATION
SUNNY RESORT
COOPERATIVE SOCIETY
CITY GOVERNMENT OF PLAZA
COLLEGE OF MUSIC

After applying the function, the output dataframe is

Name	Segment
WORLD FOUNDATION	Foundation
SUNNY RESORT	Hotels
COOPERATIVE SOCIETY	Cooperative
CITY GOVERNMENT OF PLAZA	Government
COLLEGE OF MUSIC	School

CodePudding user response：

You can vectorize 100% of your code (you should avoid apply when you are using pandas, there is nearly always a way to use something else). When vectorizing (applying all your operation at the same time on a vector), you should expect huge gains in performance compare to looping over rows/using apply.

There is two separate issue in your code:

jaro_winkler text distance you can absolutely use jaro winkler in a vectorized way : Applying Jaro-Winkler distance to dataframe
Concerning the creation of the segment column, you should not iterate over the list, I would use a code like this:

df.loc[
   df.Name.isin(Foundation),"Segment"] = "Foundation"
df.loc[
   df.Name.isin(Hotels),"Segment"] = "Hotels"
df.loc[
   df.Name.isin(Cooperative),"Segment"] = "Cooperative"
df.loc[
   df.Name.isin(Government),"Segment"] = "Government"

Concerning your re.sub you could use str.replace (also in a vectorized way) (https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html )

You could also include that in a function, and make the modifications to your dataframe by using df.pipe(your_function)