Basically the function just checks if ever some string i from a dataframe which consists of a column of name
, is found in each of the list and then it returns that value. the first loop is responsible for checking the text distance which makes the process run for hours, is there any way for this to be converted through vectorization for example for this to run faster?
def check_segment(name):
name = re.sub('\W ', ' ', name)
name = re.sub('\*', '', name)
for i in top_1k:
x = textdistance.jaro_winkler(i, name)
if x > 0.8:
return 'Top 1K'
for i in school:
if i in name:
return 'School'
for i in hospital:
if i in name:
return 'Hospital'
for i in govt:
if i in name:
return 'Government'
for i in coop:
if i in name:
return 'Cooperative'
for i in banks:
if i in name:
return 'Bank'
for i in sarisari:
if i in name:
return 'Sari-Sari Store'
for i in malls:
if i in name:
return 'Malls'
for i in remittance_center:
if i in name:
return 'Remittance Center'
for i in hotel:
if i in name:
return 'Hotels'
for i in foundation:
if i in name:
return 'Foundation'
for i in embassy:
if i in name:
return 'Embassy'
return 'SME'
lists are created wherein if a certain word from the string matches a word from the list, you will be label:
school = ["UNIVERSITY","ACADEMY","COLLEGE","ACADEME","SCHOOL","MONTESSORI","ELEMENTARY","HIGH SCHOOL","COLLEGIO","INSTITUTE"]
hospital = ["HOSPITAL","LABORATORY","CLINIC","MEDICAL","DIAGNOSTIC","HEALTH","DOCTOR", "HEALTHCARE"]
govt = ["DEPARTMENT OF","CITY GOVERNMENT","OFFICE OF THE","PROVINCE OF","PROVINCIAL","CITY TREASURER","REGISTRY OF","REGISTER OF",
"BUREAU OF","MUNICIPAL","COMMISSION","PEZA","HDMF","WATER DISTRICT","HOME DEVELOPMENT MUTUAL FUND","CLERK OF COURT",
"CITY OF","BARANGAY", "GOVERNMENT"]
coop = ["COOP", "COOPERATIVE"]
hotel = ["HOTEL","RESORT", "CONDOTEL", "HOTELIERS", "INN"]
foundation = ["FOUNDATION"]
embassy = ["EMBASSY"]
df['segment'] = df['name'].apply(check_segment)
The input dataframe is:
Name |
---|
WORLD FOUNDATION |
SUNNY RESORT |
COOPERATIVE SOCIETY |
CITY GOVERNMENT OF PLAZA |
COLLEGE OF MUSIC |
After applying the function, the output dataframe is
Name | Segment |
---|---|
WORLD FOUNDATION | Foundation |
SUNNY RESORT | Hotels |
COOPERATIVE SOCIETY | Cooperative |
CITY GOVERNMENT OF PLAZA | Government |
COLLEGE OF MUSIC | School |
CodePudding user response:
You can vectorize 100% of your code (you should avoid apply
when you are using pandas, there is nearly always a way to use something else). When vectorizing (applying all your operation at the same time on a vector), you should expect huge gains in performance compare to looping over rows/using apply.
There is two separate issue in your code:
- jaro_winkler text distance you can absolutely use jaro winkler in a vectorized way : Applying Jaro-Winkler distance to dataframe
- Concerning the creation of the segment column, you should not iterate over the list, I would use a code like this:
df.loc[
df.Name.isin(Foundation),"Segment"] = "Foundation"
df.loc[
df.Name.isin(Hotels),"Segment"] = "Hotels"
df.loc[
df.Name.isin(Cooperative),"Segment"] = "Cooperative"
df.loc[
df.Name.isin(Government),"Segment"] = "Government"
- Concerning your
re.sub
you could usestr.replace
(also in a vectorized way) (https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html )
You could also include that in a function, and make the modifications to your dataframe by using df.pipe(your_function
)