I am looking for an expression that removes numbers when the word is longer than 8 characters.
For example:
"Python300" -> "Python"
"Python37" -> "Python37"
I use this expression ^(?=.*[a-zA-Z0-9]{8,})(?=.*[0-9]).*$
but select all.
Thank you!!
CodePudding user response:
Can't it be a simple if
?
import re
max_length = 9
s = 'Python300'
s = s if len(s) < max_length else re.sub(r'[0-9] ', '', s)
CodePudding user response:
You can match using this regex to remove all trailing digits from words with length greater than 8
:
\b(?=\w{9,})(\w ?)\d \b
and replace using:
r'\1'
RegEx Explanation:
\b
: Word boundary(?=\w{9,})
: Make sure word has 9 or more characters(\w ?)
: Match 1 word chars in capture group #1 (lazy match)\d
: Match 1 trailing digits\b
: Word boundary
Code:
import re
arr = ['Python300', 'Python37']
for s in arr:
print (re.sub(r'\b(?=\w{9,})(\w ?)\d \b', r'\1', s))
Output:
Python
Python37
CodePudding user response:
I tried to use the regex expression but it didn't work.
I have put a code in pyspark so that it can be replicated.
Thanks anyway
a = ['python37', 'python300', '19Covid', '1234Spark', 'spark-2-python']
b = ['python37', 'python', '19Covid', 'Spark', 'spark--python']
impacto = pd.DataFrame (zip(a,b), columns = ['input', "expected"])
spark.createDataFrame(impacto) \
.withColumn("result", sf.regexp_replace(sf.col("input"), r"\b(?=\w{9,})(\w ?)\d \b", r'\1')) \
.show()
-------------- ------------- --------------
| input| expected| result|
-------------- ------------- --------------
| python37| python37| python37|
| python300| python| 1|
| 19Covid| 19Covid| 19Covid|
| 1234Spark| Spark| 1234Spark|
|spark-2-python|spark--python|spark-2-python|
-------------- ------------- --------------