I have a somewhat unusual problem.I am currently trying to program a chat filter for discord in Java 16.
Here I ran into the problem that in German there are several ways to write a word to get around this filter.
As an example I now take the insult "Hurensohn". Now you could simply write "Huränsohn" or "Hur3nsohn" in the chat and thus bypass the filter quite easily.
Since I don't want to manually pack every possibility into the filter, I thought about how I could do it automatically.So the first thing I did was to create a hashmap with all possible alternativeven letters, which looked something like this:
Map<String, List<String>> alternativeCharacters = new HashMap<>();
alternativeCharacters.put( "E", List.of( "ä", "3" ) );
I tried to change the corresponding letters in the words and add them to the chat filter, which actually worked.
But now we come to the problem: To be able to cover all possible combinations, it doesn't do me much good to change only one type of letter in a word.
If we now take the word "Einschalter" and change the letter "e" here, we could also simply change the "e" here with a "3" or with an "ä", whereby then the following would come out:
- 3einschal3r
- Einschalt3r
- 3inschalter
and
- Äinschalär
- Einschaltär
- Äinschalter
But now I also want "mixed" words to be created. e.g. "3inschalär", where both the "Ä" and the "3" are used to create a word. Where then the following combinations would come out:
- 3inschalär
- Äinschalt3r
Does anyone know how I can relaize something like that? With the normal replace() method I haven't found a way yet to create "mixed" replaces.
I hope people understand what kind of problem I have and what I want to do. :D
Current method used for replacing:
public static List<String> replace( String word, String from, String... to ) {
final int[] index = { 0 };
List<String> strings = new ArrayList<>();
/* Replaces all letters */
List.of( to ).forEach( value -> strings.add( word.replaceAll( from, value ) ) );
/* Here is the problem. Here only one letter is edited at a time and thus changed in the word */
List.of( to ).forEach( value -> {
List.of( word.split( "" ) ).forEach( letters -> {
if ( letters.equalsIgnoreCase( from ) ) {
strings.add( word.substring( 0, index[0] ) value "" word.substring( index[0] 1 ) );
}
index[0] ;
} );
index[0] = 0;
} );
return strings;
}
CodePudding user response:
As said by others, you can’t keep up with the creativity of people. But if you want to continue using such a check, you should use the right tool for the job, i.e. a RuleBasedCollator
.
RuleBasedCollator c = new RuleBasedCollator("<i,I=1=!<e=ä,E=3=Ä<o=0,O");
c.setStrength(Collator.PRIMARY);
String a = "3inschaltär", b = "Einschalter";
if(c.compare(a, b) == 0) {
System.out.println(a " matches " b);
}
3inschaltär matches Einschalter
This class even allows efficient hash lookups
// using c from above
// prepare map
var map = new HashMap<CollationKey, String>();
for(String s: List.of("Einschalter", "Hicks-Boson")) {
map.put(c.getCollationKey(s), s);
}
// use map for lookup
for(String s: List.of("Ä!nschalt3r", "H1cks-B0sOn")) {
System.out.println(s);
String match = map.get(c.getCollationKey(s));
if(match != null) System.out.println("\ta variant of " match);
}
Ä!nschalt3r
a variant of Einschalter
H1cks-B0sOn
a variant of Hicks-Boson
While a Collator
can be used for sorting, you’re only interested in identifying equals strings. Therefore, I didn’t care to specify a useful order, which simplifies the rules, as we only need to specify the characters supposed to be equal.
The linked documentation explains the syntax; in short, I=1=!
defines the character I
, 1
, and !
as equal, whereas prepending i,
defines i
to be a different case of the other characters. Likewise, e=ä,E=3=Ä
defines e
equal to ä
and both being different case than the characters E
, 3
, Ä
. Eventually, the <
separator defines characters to be different. It’s also defining a sorting order which, as said, we don’t care about in this usage.
As an addendum, the following can be used to remove accents and other marking from characters, except for umlauts, as you want to match German words. This would remove the requirement to deal with an exploding number of obfuscated character combinations, especially from people who know about Zalgo text converters:
String s = "òñę ảëîöū";
String n = Normalizer.normalize(s, Normalizer.Form.NFD)
.replaceAll("(?!(?<=[aou])\u0308)\\p{Mn}", "");
System.out.println(s " -> " n);
òñę ảëîöū -> one aeiöu
CodePudding user response:
Off the top of my head, you may try to approach this using regular expressions, compiling patterns by replacing the respective letters where multiple ways of writing may occur in your dictionary.
E.g. in the direction of:
record LetterReplacements(String letter, List<String> replacements){}
public Predicate<String> generatePredicateForDictionaryWord(String word){
var letterA = new LetterReplacements("a", List.of("a", "A", "4"));
var writingStyles = letterA.replacements.stream()
.collect(Collectors.joining("|", "(", ")"));
var pattern = word.replaceAll(letterA.letter, writingStyles);
return Pattern.compile(pattern).asPredicate();
}
Example usage:
@ParameterizedTest
@CsvSource({
"maus,true",
"m4us,true",
"mAus,true",
"mous,false"
})
void testDictionaryPredicates(String word, boolean expectedResult) {
var predicate = underTest.generatePredicateForDictionaryWord("maus");
assertThat(predicate.test(word)).isEqualTo(expectedResult);
}
However I doubt that any approach in this direction would yield sufficient results in terms of performance, especially since I expect your dictionary to grow rather fast and the number of different writing "styles" to be rather large.
So please regard the snippet above only as explanation for the approach I was talking about. Again, I doubt you would yield sufficient performance, even if precompiling all patterns and the predicate combinations beforehand.