So In my college class, we are learning assembly in visual studio enterprise. Recently I got a task to convert by hand letters from basic ASCII code to polish letters (for example a : ą, o : ó etc).
I was also given tables with different sign coding systems ( what letter is behind such value). So after trying out one after one, it turns out that the coding is latin2.
my question is if it's some sort of default value for signs in assembly, or is it just visual property of single-byte signs? Because I suppose, when it comes to for example emojis their value must be stored in bigger values.
Another question is how can I tell which coding system is used? Do you have to just try out some values and check if they match what is coded in some system?
Code down below(sorry for polish comments, its copied from my class):
; program przykładowy (wersja 32-bitowa)
.686
.model flat
extern _ExitProcess@4 : PROC
extern __write : PROC ; (dwa znaki podkreślenia)
public _main
.data
tekst db 10, 'Nazywam sie . . . ' , 10
db 'M',0A2H, 'j pierwszy 32-bitowy program '
db 'asemblerowy dzia', 88H ,'a j',75H,0BEH,' poprawnie!', 10
.code
_main PROC
mov ecx, 85 ; liczba znaków wyświetlanego tekstu
; wywołanie funkcji ”write” z biblioteki języka C
push ecx ; liczba znaków wyświetlanego tekstu
push dword PTR OFFSET tekst ; położenie obszaru
; ze znakami
push dword PTR 1 ; uchwyt urządzenia wyjściowego
call __write ; wyświetlenie znaków
; (dwa znaki podkreślenia _ )
add esp, 12 ; usunięcie parametrów ze stosu
; zakończenie wykonywania programu
push dword PTR 0 ; kod powrotu programu
call _ExitProcess@4
_main ENDP
END
I want to mention that this question is not about making this code work, because it already does, it is a 100% theoretical question on how can I tell which sign coding is used in assembly.
CodePudding user response:
Personal computers are based on ASCII codes with values 0..127, which can be extended to encode additional values 128..255. The relation between those values 0..255 and the glyphs (font) which will be displayed, is called code page.
256 displayable characters is not enough to cover all non-european alphabets, emojis etc, that's why Unicode table was established. Characters with value defined in Unicode can be encoded in many ways.
Unix-based systems use UTF-8 encoding (one character is encoded in one, two, three or four bytes).
MS Windows uses UTF-16 encoding (two or four bytes) in WIDE variant of its functions, and alternatively ANSI or OEM encoding (one byte per character).
Those one-byte variants require to specify code page.
You said that db 'M',0A2H, 'j pierwszy 32-bitowy program '
displays the text correctly, i.e. Mój pierwszy 32-bitowy program. This proves that the value 0A2h
maps to a glyph ó, which corresponds with OEM code page 852.
Most Windows console functions use OEM encoding (one byte per character) and the code page can be specified in properties of console window.
How can you tell which encoding is used in assembly? It depends on the method used to write on screen. The function __write()
apparently uses WriteConsoleA() or WriteFile(), this should be documented in its description together with the expected encoding. See also character encoding in asm.