Split long string with spaces but without punctuation-CodePudding

I have a long string that i need to break by spaces so i did this in ios

let str = """
يَا أَيُّهَا الَّذِينَ آمَنُوا لَا تَقْرَبُوا الصَّلَاةَ وَأَنْتُمْ سُكَارَىٰ حَتَّىٰ تَعْلَمُوا مَا تَقُولُونَ وَلَا جُنُبًا إِلَّا عَابِرِي سَبِيلٍ حَتَّىٰ تَغْتَسِلُوا ۚ وَإِنْ كُنْتُمْ مَرْضَىٰ أَوْ عَلَىٰ سَفَرٍ أَوْ جَاءَ أَحَدٌ مِنْكُمْ مِنَ الْغَائِطِ أَوْ لَامَسْتُمُ النِّسَاءَ فَلَمْ تَجِدُوا مَاءً فَتَيَمَّمُوا صَعِيدًا طَيِّبًا فَامْسَحُوا بِوُجُوهِكُمْ وَأَيْدِيكُمْ ۗ إِنَّ اللَّهَ كَانَ عَفُوًّا غَفُورًا
"""
let count = str.components(separatedBy: " ").count
        
print(count) // 49

and it gives 49 but same thing in kotlin gives 51 here

val str = getString(R.string.valueHere)

val count = str.split(" ").count()

Log.d("count is " , count.toString()) // 51

With

<string name="valueHere">يَا أَيُّهَا الَّذِينَ آمَنُوا لَا تَقْرَبُوا الصَّلَاةَ وَأَنْتُمْ سُكَارَىٰ حَتَّىٰ تَعْلَمُوا مَا تَقُولُونَ وَلَا جُنُبًا إِلَّا عَابِرِي سَبِيلٍ حَتَّىٰ تَغْتَسِلُوا ۚ وَإِنْ كُنْتُمْ مَرْضَىٰ أَوْ عَلَىٰ سَفَرٍ أَوْ جَاءَ أَحَدٌ مِنْكُمْ مِنَ الْغَائِطِ أَوْ لَامَسْتُمُ النِّسَاءَ فَلَمْ تَجِدُوا مَاءً فَتَيَمَّمُوا صَعِيدًا طَيِّبًا فَامْسَحُوا بِوُجُوهِكُمْ وَأَيْدِيكُمْ ۗ إِنَّ اللَّهَ كَانَ عَفُوًّا غَفُورًا</string>

I need word count to be 49 in android; in android it seems that it counts decorate characters in spaces, How to fix this and produce the same result in Kotlin ?

Edit:

fun getColorRange(): Range<Int> { 
    
    val text =  // my long string here
    val all = text.split (" ")
    val sub = (wordFrom..wordTo).map { all[it] }.joinToString(" ")
    val lower = text.indexOf(sub)
    val upper = lower   sub.length
    return Range<Int>(lower, upper)
}

if arr length is different in Kotlin sub will be different substring

CodePudding user response：

By logging the split String to see where the issues are :

يَا
أَيُّهَا
الَّذِينَ
آمَنُوا
لَا
تَقْرَبُوا
الصَّلَاةَ
وَأَنْتُمْ
سُكَارَىٰ
حَتَّىٰ
تَعْلَمُوا
مَا
تَقُولُونَ
وَلَا
جُنُبًا
إِلَّا
عَابِرِي
سَبِيلٍ
حَتَّىٰ
تَغْتَسِلُوا
ۚ     >>>>>>>>>>>>>>>>>>>>> Problem here
وَإِنْ
كُنْتُمْ
مَرْضَىٰ
أَوْ
عَلَىٰ
سَفَرٍ
أَوْ
جَاءَ
أَحَدٌ
مِنْكُمْ
مِنَ
الْغَائِطِ
أَوْ
لَامَسْتُمُ
النِّسَاءَ
فَلَمْ
تَجِدُوا
مَاءً
فَتَيَمَّمُوا
صَعِيدًا
طَيِّبًا
فَامْسَحُوا
بِوُجُوهِكُمْ
وَأَيْدِيكُمْ
ۗ    >>>>>>>>>>>>>>>>>>>>> Problem here
إِنَّ
اللَّهَ
كَانَ
عَفُوًّا
غَفُورًا

So, apparently the problem is on the upper diacritics (or markers for accurately speaking) like ۚ or ۗ because they're not considered valid characters.

I believe that the Kotlin version is more accurate than the Swift one, because what you need is:

Separate this String on SPACE as a delimiter (FULL STOP)

What Swift tends to do is that it doesn't recognize the upper diacritics/markers, i.e. it considers them nothing, and doesn't count them when the string is split. Probably there is another Swift function that can detect that, not sure about that as this is not a part of your question.

And as you have a couple of those markers; therefore the Kotlin version count more than the Swift one by two (i.e. 51 instead of 49).

So, the question would be: How to remove the upper diacritics/markers from a string before splitting it?

Thanks to this answer that lists those type of markers; and in Kotlin you can use the String replace() method to replace them with nothing:

Here is a snippet to fix your example:

var str = getString(R.string.valueHere)
str = str
    .replace("\u0615", "") //ARABIC SMALL HIGH TAH
    .replace("\u0616", "") //ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
    .replace("\u0617", "") //ARABIC SMALL HIGH ZAIN
    .replace("\u0618", "") //ARABIC SMALL FATHA
    .replace("\u0619", "") //ARABIC SMALL DAMMA
    .replace("\u061A", "") //ARABIC SMALL KASRA
    .replace("\u06D6", "") //ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
    .replace("\u06D7", "") //ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
    .replace("\u06D8", "") //ARABIC SMALL HIGH MEEM INITIAL FORM
    .replace("\u06D9", "") //ARABIC SMALL HIGH LAM ALEF
    .replace("\u06DA", "") //ARABIC SMALL HIGH JEEM
    .replace("\u06DB", "") //ARABIC SMALL HIGH THREE DOTS
    .replace("\u06DC", "") //ARABIC SMALL HIGH SEEN
    .replace("\u06DD", "") //ARABIC END OF AYAH
    .replace("\u06DE", "") //ARABIC START OF RUB EL HIZB
    .replace("\u06DF", "") //ARABIC SMALL HIGH ROUNDED ZERO
    .replace("\u06E0", "") //ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
    .replace("\u06E1", "") //ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
    .replace("\u06E2", "") //ARABIC SMALL HIGH MEEM ISOLATED FORM
    .replace("\u06E3", "") //ARABIC SMALL LOW SEEN
    .replace("\u06E4", "") //ARABIC SMALL HIGH MADDA
    .replace("\u06E5", "") //ARABIC SMALL WAW
    .replace("\u06E6", "") //ARABIC SMALL YEH
    .replace("\u06E7", "") //ARABIC SMALL HIGH YEH
    .replace("\u06E8", "") //ARABIC SMALL HIGH NOON
    .replace("\u06E9", "") //ARABIC PLACE OF SAJDAH
    .replace("\u06EA", "") //ARABIC EMPTY CENTRE LOW STOP
    .replace("\u06EB", "") //ARABIC EMPTY CENTRE HIGH STOP
    .replace("\u06EC", "") //ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
    .replace("\u06ED", "") //ARABIC SMALL LOW MEEM

val split = str.split(" ")

val count = str.split(" ").count {
    it.isNotBlank()
}
Log.d("count is ", "$count")

This is the test verification result on a Kotlin compiler

UPDATE:

I have a long string that I need to color range inside it with a different color inside a textView , so split it with spaces get needed words by lower and upper word index, then join them in one string to color their range inside the long string , the above answer did give 49 but it removed important characters mentioned with replace , so any try to tweak your code to consider this ?

So, if you'd follow the top approach, you just need to remove the blanks from the split String, for this you can use the filter{} reduction after replacing all the markers with blanks

fun getColorRange(input: String, wordFrom: Int, wordTo: Int): Range<Int> {
    val text = input
        .replace("\u0615", "") //ARABIC SMALL HIGH TAH
        .replace("\u0616", "") //ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
        .replace("\u0617", "") //ARABIC SMALL HIGH ZAIN
        .replace("\u0618", "") //ARABIC SMALL FATHA
        .replace("\u0619", "") //ARABIC SMALL DAMMA
        .replace("\u061A", "") //ARABIC SMALL KASRA
        .replace("\u06D6", "") //ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
        .replace("\u06D7", "") //ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
        .replace("\u06D8", "") //ARABIC SMALL HIGH MEEM INITIAL FORM
        .replace("\u06D9", "") //ARABIC SMALL HIGH LAM ALEF
        .replace("\u06DA", "") //ARABIC SMALL HIGH JEEM
        .replace("\u06DB", "") //ARABIC SMALL HIGH THREE DOTS
        .replace("\u06DC", "") //ARABIC SMALL HIGH SEEN
        .replace("\u06DD", "") //ARABIC END OF AYAH
        .replace("\u06DE", "") //ARABIC START OF RUB EL HIZB
        .replace("\u06DF", "") //ARABIC SMALL HIGH ROUNDED ZERO
        .replace("\u06E0", "") //ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
        .replace("\u06E1", "") //ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
        .replace("\u06E2", "") //ARABIC SMALL HIGH MEEM ISOLATED FORM
        .replace("\u06E3", "") //ARABIC SMALL LOW SEEN
        .replace("\u06E4", "") //ARABIC SMALL HIGH MADDA
        .replace("\u06E5", "") //ARABIC SMALL WAW
        .replace("\u06E6", "") //ARABIC SMALL YEH
        .replace("\u06E7", "") //ARABIC SMALL HIGH YEH
        .replace("\u06E8", "") //ARABIC SMALL HIGH NOON
        .replace("\u06E9", "") //ARABIC PLACE OF SAJDAH
        .replace("\u06EA", "") //ARABIC EMPTY CENTRE LOW STOP
        .replace("\u06EB", "") //ARABIC EMPTY CENTRE HIGH STOP
        .replace("\u06EC", "") //ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
        .replace("\u06ED", "") //ARABIC SMALL LOW MEEM

    val all = text.split(" ").filter { it.isNotBlank() } // Remove the blanks (i.e. the markers)
    val sub = (wordFrom..wordTo).map { all[it] }.joinToString(" ")

    Log.d("LOG_TAG", "getColorRange: $sub")
    val range = text.indexOf(sub[0], wordFrom)
    return Range<Int>(range, range   sub.length)
}

Sample usage:

getColorRange(str, 18, 22)

// Output:
//  حَتَّىٰ تَغْتَسِلُوا وَإِنْ كُنْتُمْ مَرْضَىٰ

getColorRange(str, 0, 48) // Should return the entire string as this is the total number of words

// Output:
// يَا أَيُّهَا الَّذِينَ آمَنُوا لَا تَقْرَبُوا الصَّلَاةَ وَأَنْتُمْ سُكَارَىٰ حَتَّىٰ تَعْلَمُوا مَا تَقُولُونَ وَلَا جُنُبًا إِلَّا عَابِرِي سَبِيلٍ حَتَّىٰ تَغْتَسِلُوا وَإِنْ كُنْتُمْ مَرْضَىٰ أَوْ عَلَىٰ سَفَرٍ أَوْ جَاءَ أَحَدٌ مِنْكُمْ مِنَ الْغَائِطِ أَوْ لَامَسْتُمُ النِّسَاءَ فَلَمْ تَجِدُوا مَاءً فَتَيَمَّمُوا صَعِيدًا طَيِّبًا فَامْسَحُوا بِوُجُوهِكُمْ وَأَيْدِيكُمْ إِنَّ اللَّهَ كَانَ عَفُوًّا غَفُورًا

Also notice that there is an issue in the range value, as the sub is a list, not a String, so the below is wrong

val range = text.indexOf(sub)

Instead, you need to get the index of the first item in the sub, and starting from the wordFrom not from the beginning of the string:

val range = text.indexOf(sub[0], wordFrom)