I have a list of strings that contain Non-English/English words. I want to filter out only English words.
Example:
phrases = [
"S/O अशोक कुमार, ब्लॉक न.-4डी, S/O Ashok Kumar, Block no.-4D.",
"स्ट्रीट-15, विभाग 5. सिविक सेंटर Street-15, sector -5, Civic Centre",
"भिलाई, दुर्ग, भिलाई, छत्तीसगढ़, Bhilai, Durg. Bhilai, Chhattisgarh,",
]
My code so far:
import re
regex = re.compile("[^a-zA-Z0-9!@#$&()\\-`. ,/\"] ")
for i in phrases:
print(regex.sub(' ', i))
My output:
["S/O , .-4 , S/O Ashok Kumar, Block no.-4D.",
"-15, 5. Street-15, sector -5, Civic Centre",
", , , , Bhilai, Durg. Bhilai, Chhattisgarh",]
My desire output
["S/O Ashok Kumar, Block no.-4D.",
"Street-15, sector -5, Civic Centre",
"Bhilai, Durg. Bhilai, Chhattisgarh,"]
CodePudding user response:
If I look at your data it seems you could use the following:
import regex as re
lst=["S/O अशोक कुमार, ब्लॉक न.-4डी, S/O Ashok Kumar, Block no.-4D.",
"स्ट्रीट-15, विभाग 5. सिविक सेंटर Street-15, sector -5, Civic Centre",
"भिलाई, दुर्ग, भिलाई, छत्तीसगढ़, Bhilai, Durg. Bhilai, Chhattisgarh,",]
for i in lst:
print(re.sub(r'^.*\p{Devanagari}. ?\b', '', i))
Prints:
S/O Ashok Kumar, Block no.-4D.
Street-15, sector -5, Civic Centre
Bhilai, Durg. Bhilai, Chhattisgarh,
See an online regex demo
^
- Start string anchor;.*\p{Devanagari}
- 0 (Greedy) characters upto the last Devanagari letter;. ?\b
- 1 (Lazy) characters upto the first word-boundary
CodePudding user response:
If you mean that your characters may only be of standard english letters and your regex works for that and you only want to filter out the problematic ", , , ," values you could do something like this:
def format_output(current_output):
results = []
for row in current_output:
# split on the ","
sub_elements = row.split(",").
# this will leave the empty ones as "" in the list which can be filtered
filtered = list(filter(key=lambda x: len(x) > 0, sub_elements))
# then join the elements togheter and append to the final results array
results.append(",".join(filtered))
CodePudding user response:
It seems to me that the first part of each element of the list is the Hindi translation of the second part and there is a one-to-one correspondence between the number of words.
So for the example that you've provided and any that follow the exact same pattern(it will break if it does not), all you have to do is take the second part of each element of the array.
phrases = ["S/O अशोक कुमार, ब्लॉक न.-4डी, S/O Ashok Kumar, Block no.-4D.",
"स्ट्रीट-15, विभाग 5. सिविक सेंटर Street-15, sector -5, Civic Centre",
"भिलाई, दुर्ग, भिलाई, छत्तीसगढ़, Bhilai, Durg. Bhilai, Chhattisgarh,",]
mod_list = []
for s in list:
tmp_list = []
strg = s.split()
n = len(strg)
for i in range(int(n/2),n):
tmp_list.append(strg[i])
tmp_list = ' '.join(tmp_list)
mod_list.append(tmp_list)
print(mod_list)
Output:
['S/O Ashok Kumar, Block no.-4D.',
'Street-15, sector -5, Civic Centre',
'Bhilai, Durg. Bhilai, Chhattisgarh,']