Encoding Intel x86/IA-32 Assembler Instructions

Translation of the second line is a direct and solved issue. What about jmp 114? Well, we want to jump over the data (18 bytes, one byte per each character in the string.) IASDM tell us (Appendix B) that the opcode for unconditional jumps in the same segment is 11101011, which in hexadecimal, is expressed as EB.

On the post Debugging hello, world, someone asked about the reason for translating the instruction jmp 114 into hexadecimal EB12. To answer this, we are going to recur to the “lovely” and elder Intel Architecture Software Developer Manual (IASDM), Volume 2. This volume describes the instructions set of the Intel Architecture processor (x86/IA-32) and the opcode structure. I’ll review some terms involved here:

    x86: It refers to the instruction set of the Intel-compatible CPU architectures (chips produced by Intel, AMD, VIA, and others) inaugurated by Intel’s original 16-bit 8086 CPU. A decision which proved wise was to make each new instance of x86 processors almost fully backwards compatible.
    IA-32: It is Intel’s 32-bit implementation of the x86 architecture; IA-32 distinguishes this implementation from the preceding 16-bit x86 processors. Note that when the 64-bit era arrived, Intel launched its Itanium processor, which discards compatibility with the IA-32 instruction set. Such 64-bit architecture description and implementation is referred to as IA-64, meaning “Intel Architecture, 64-bit”, but even though the names are similar, IA-32 and IA-64 are very different architectures and instructions sets. However, AMD’s response to Intel 64-bit processors, uses an instruction set that, in essence, is composed of 64-bit extensions to IA-32, i.e., it’s a superset of the x86 instruction set. Such instruction set is referred to as AMD64 (initially, x86-64.) Later, Intel cloned it under the name Intel 64. AMD’s processors Athlon 64, Terium, Opteron, Sempron, etc., are based on AMD64.
    Opcode: An opcode (operation code) is the part of a machine language instruction (pure binary code) specifying the operation to be performed. The other portion of the instruction is the operand, which is optional and represents the data to be operated on. In assembly language, mnemonics are used to represent the opcodes. Concretely, and according to the IASDM, a mnemonic is a reserved name for a class of instruction opcodes which have the same function. For example, in JMP 114, the mnemonic is JMP, and the operand is 114 (remember, 114 in hexadecimal, which is 276 in decimal.)

Unlike in high-level languages, there is usually a one-to-one correspondence between basic assembly statements and the binary code of machine language instructions. Nevertheless, in some cases, an assembler may provide pseudo-instructions which expand into several machine language instructions to provide commonly needed functionality. Or no instruction at all, such as DB in

db 0d,0a,"hello, world!",0d,0a,"$"

which directly translates into the sequence of characters (in hexadecimal):

0D 0A 68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 0D 0A 24

Therefore, pseudo-instruction DB acts only as a data markup for the assembler. Now, for clarity, I’ll repeat the code of Debugging “hello, world” here:

- a 100
CS:0100 jmp 114         ; Jump over the 18 bytes of the string
CS:0102 db 0d,0a,"hello, world!",0d,0a,"$"
CS:0114 mov ah,9       ; Print function
CS:0116 mov dx,102
CS:0119 int 21
CS:011B mov ah, 0      ; Terminate the program
CS:011D int 21
CS:011F
-g =100

Translation of the second line is a direct and solved issue. What about jmp 114? Well, we want to jump over the data (18 bytes, one byte per each character in the string.) IASDM tell us (Appendix B) that the opcode for unconditional jumps in the same segment is 11101011, which in hexadecimal, is expressed as EB. We need to provide the operand for completing the instruction. In this case, as we want to jump over the string data, our operand is 18 (12 in hexadecimal.) That’s why jmp 114 translates into EB12. Note that the operand for this jmp specifies the 8-bit displacement, i.e., the operand is not an explicit address.

Translation of the other instructions is straightforward, and again we only have to follow the IASDM. Let’s analyze encoding of mov ah,9 anyway. In this case we have an immediate operand (a constant, 9.) Thus, for moving an immediate operand to a register the encoding adopts this form:

1011 w reg : immediate data

There, w represents the bit for operand size. That bit specifies if data is byte or full-sized (where full-sized is either 16 or 32 bits.) As we’ll be using 8-bit operands, set the bit to 0. On its side, reg is a 3-bit sequence identifying the destination register. Table B-3 of the IASDM dictates that if w = 0, then register AH is encoded as binary 100. Thus, encoding of mov ah,9 is

10110100 00001001

which in hexadecimal is expressed as B409. The next instruction, mov dx,102, follows a similar approach:

1011 1 010 0000 0001 0000 0010

In this case, however, w is set to 1, as the operand 102 requires more than 1-byte storage. The 3-bit sequence for DX is 010. Needless to say, 0000 0001 0000 0010 is the binary representation of the hexadecimal value 102 (16 bits are required). Expressing in hexadecimal, we would have BA0102. However, the bytes for the operand has to be stored in reverse order, and thereby the right encoding for the instruction is BA0201.

Next, INT n (Interruption type n) is encoded as 1100 1101 : type. Therefore, int 21 is encoded as 1100 1101 0010 0001 (CD21 in hexadecimal.) And encoding of mov ah, 0 as B400 follows directly from our previous explanations. Finally, we can translate our little “hello, world!” into binary code directly:

-e 100 EB 12 0D 0A 68 65 6C 6C 6F 2C 20 77 6F 72 6C 64
-e 110 21 0D 0A 24 B4 09 BA 02 01 CD 21 B4 00 CD 21 0D
-g =100

And that’s all. I think that my explanations have been clear. But I’m always open to any suggestions and corrections. Thanks for reading.

Meta

2010 – July, 17th

Thanks to BaBax for pointing out an error in the encoding of db 0d,0a,"hello, world!",0d,0a,"$". I had involuntarily included the address of the character “!” into the encoding. Thanks, BaBax, for the correction.

20 thoughts on “Encoding Intel x86/IA-32 Assembler Instructions”

  1. I think that assembly is not my thing anymore… And in the remote chance I code in assembly again, I’d go for RISC architectures.

  2. Thanks for the post… I didn’t know the IA32 – IA64 thing. But now the translation from assembly to binary code seems pretty straightforward to me.

    Heck, perhaps we could now code directly in binary!

    🙂

  3. @Carlos_Vasquez: “And in the remote chance I code in assembly again”

    Tell that to my hardware architecture professor!

  4. So far, so good. Your article is very clear, and now the doubts about JMP 114 translation should be out.

    However, please, please, don’t forget to change the layout colors. Those ‘grayed’ assembly comments are killing my eyes.

  5. …hace rato que no programo en ensamblador, pero el post me trae buenos recuerdos… y la explicación está muy bien hecha

    prácticamente, podemos abrir un archivo de texto y escribir los caracteres en hexadecimal, cambiamos la extensión a ejecutable y deberíamos tener un programa funcionando, sin ensamblador ni compilador…

    ahora, yo no le dedicaría más tiempo al MS-DEBUG, y me iría por algo mucho mejor como el Gas (GNU as)…

  6. Thanks everybody for dropping by, and thanks for your comments.

    @El_Hombre_Que_Programaba: Aunque aún no estoy seguro, dudo que los próximos 2 o 3 artículos traten de ensamblador. Y lo expuesto en el artículo, en su mayoría, es de carácter general (no restringido a MS-DEBUG). La referencia a MS-DEBUG proviene del post anterior.

  7. @El_Hombre_Que_Programaba: Cómo harás para introducir el 0D final en el archivo de texto?

  8. Wow, now tell us about coding in pure binary. Who does need C++, Python, Perl and such inefficient things?

  9. It’s ok! Without the knowledge of assembler mnemonic it is
    hard to understand what a “high-tech computer” is doing when
    it is booted.
    Good explanation!

    By the way, there is an error in the part

    ———————————————————-
    “which directly translates into the sequence of characters
    (in hexadecimal):

    0D 0A 68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 110 21 0D 0A 24”
    ———————————————————-

    The number 110 is the address of the byte 0x21 (the
    character “!”) and doesn’t belong to the string.

    The 0x110 is out of range of standard ASCII and the whole
    string would look like “hello, worldÉ!” if you are using
    Extended ASCII Codes. Furthermore, with the 0x110 the
    whole string would be 19 characters long (as shown) and not
    18.

  10. You’re right, BaBax. I’m sorry for the mistake. I have corrected the encoding, and included a note recognizing your contribution.

    Thank you very much.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.