Encoding Intel x86/IA-32 Assembler Instructions

On the post Debugging hello, world, someone asked about the reason for translating the instruction jmp 114 into hexadecimal EB12. To answer this, we are going to recur to the “lovely” and elder Intel Architecture Software Developer Manual (IASDM), Volume 2. This volume describes the instructions set of the Intel Architecture processor (x86/IA-32) and the opcode structure. I’ll review some terms involved here:

    x86: It refers to the instruction set of the Intel-compatible CPU architectures (chips produced by Intel, AMD, VIA, and others) inaugurated by Intel’s original 16-bit 8086 CPU. A decision which proved wise was to make each new instance of x86 processors almost fully backwards compatible.
    IA-32: It is Intel’s 32-bit implementation of the x86 architecture; IA-32 distinguishes this implementation from the preceding 16-bit x86 processors. Note that when the 64-bit era arrived, Intel launched its Itanium processor, which discards compatibility with the IA-32 instruction set. Such 64-bit architecture description and implementation is referred to as IA-64, meaning “Intel Architecture, 64-bit”, but even though the names are similar, IA-32 and IA-64 are very different architectures and instructions sets. However, AMD’s response to Intel 64-bit processors, uses an instruction set that, in essence, is composed of 64-bit extensions to IA-32, i.e., it’s a superset of the x86 instruction set. Such instruction set is referred to as AMD64 (initially, x86-64.) Later, Intel cloned it under the name Intel 64. AMD’s processors Athlon 64, Terium, Opteron, Sempron, etc., are based on AMD64.
    Opcode: An opcode (operation code) is the part of a machine language instruction (pure binary code) specifying the operation to be performed. The other portion of the instruction is the operand, which is optional and represents the data to be operated on. In assembly language, mnemonics are used to represent the opcodes. Concretely, and according to the IASDM, a mnemonic is a reserved name for a class of instruction opcodes which have the same function. For example, in JMP 114, the mnemonic is JMP, and the operand is 114 (remember, 114 in hexadecimal, which is 276 in decimal.)

Unlike in high-level languages, there is usually a one-to-one correspondence between basic assembly statements and the binary code of machine language instructions. Nevertheless, in some cases, an assembler may provide pseudo-instructions which expand into several machine language instructions to provide commonly needed functionality. Or no instruction at all, such as DB in

db 0d,0a,"hello, world!",0d,0a,"$"

which directly translates into the sequence of characters (in hexadecimal):

0D 0A 68 65 6C 6C 6F 2C 20 77 6F 72 6C 64 21 0D 0A 24

Therefore, pseudo-instruction DB acts only as a data markup for the assembler. Now, for clarity, I’ll repeat the code of Debugging “hello, world” here:

- a 100
CS:0100 jmp 114         ; Jump over the 18 bytes of the string
CS:0102 db 0d,0a,"hello, world!",0d,0a,"$"
CS:0114 mov ah,9       ; Print function
CS:0116 mov dx,102
CS:0119 int 21
CS:011B mov ah, 0      ; Terminate the program
CS:011D int 21
CS:011F
-g =100

Translation of the second line is a direct and solved issue. What about jmp 114? Well, we want to jump over the data (18 bytes, one byte per each character in the string.) IASDM tell us (Appendix B) that the opcode for unconditional jumps in the same segment is 11101011, which in hexadecimal, is expressed as EB. We need to provide the operand for completing the instruction. In this case, as we want to jump over the string data, our operand is 18 (12 in hexadecimal.) That’s why jmp 114 translates into EB12. Note that the operand for this jmp specifies the 8-bit displacement, i.e., the operand is not an explicit address.

Translation of the other instructions is straightforward, and again we only have to follow the IASDM. Let’s analyze encoding of mov ah,9 anyway. In this case we have an immediate operand (a constant, 9.) Thus, for moving an immediate operand to a register the encoding adopts this form:

1011 w reg : immediate data

There, w represents the bit for operand size. That bit specifies if data is byte or full-sized (where full-sized is either 16 or 32 bits.) As we’ll be using 8-bit operands, set the bit to 0. On its side, reg is a 3-bit sequence identifying the destination register. Table B-3 of the IASDM dictates that if w = 0, then register AH is encoded as binary 100. Thus, encoding of mov ah,9 is

10110100 00001001

which in hexadecimal is expressed as B409. The next instruction, mov dx,102, follows a similar approach:

1011 1 010 0000 0001 0000 0010

In this case, however, w is set to 1, as the operand 102 requires more than 1-byte storage. The 3-bit sequence for DX is 010. Needless to say, 0000 0001 0000 0010 is the binary representation of the hexadecimal value 102 (16 bits are required). Expressing in hexadecimal, we would have BA0102. However, the bytes for the operand has to be stored in reverse order, and thereby the right encoding for the instruction is BA0201.

Next, INT n (Interruption type n) is encoded as 1100 1101 : type. Therefore, int 21 is encoded as 1100 1101 0010 0001 (CD21 in hexadecimal.) And encoding of mov ah, 0 as B400 follows directly from our previous explanations. Finally, we can translate our little “hello, world!” into binary code directly:

-e 100 EB 12 0D 0A 68 65 6C 6C 6F 2C 20 77 6F 72 6C 64
-e 110 21 0D 0A 24 B4 09 BA 02 01 CD 21 B4 00 CD 21 0D
-g =100

And that’s all. I think that my explanations have been clear. But I’m always open to any suggestions and corrections. Thanks for reading.

Meta

2010 – July, 17th

Thanks to BaBax for pointing out an error in the encoding of db 0d,0a,"hello, world!",0d,0a,"$". I had involuntarily included the address of the character “!” into the encoding. Thanks, BaBax, for the correction.

This entry was posted in programming and tagged , , , , , , , , , , , , , , , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

18 Comments