hello world, C and GNU as

GNU Head

Finally, it’s time to switch to the fabulous GNU as. We’ll forget about DEBUG for some time. Thanks DEBUG. GNU as, Gas, or the GNU Assembler, is obviously the assembler used by the GNU Project. It is part of the Binutils package, and acts as the default back-end of gcc. Gas is very powerful and can target several computer architectures. Quite a program, then. As most assemblers, Gas’ input is comprised of directives (also referred to as Pseudo Ops), comments, and of course, instructions. Instructions are very dependent on the target computer architecture. Conversely, directives tend to be relatively homogeneous.

1 Syntax

Originally, this assembler only accepted the AT&T assembler syntax, even for the Intel x86 and x86-64 architectures. The AT&T syntax is different to the one included in most Intel references. There are several differences, the most memorable being that two-operand instructions have the source and destinations in the opposite order. For example, instruction mov ax, bx would be expressed in AT&T syntax as movw %bx, %ax, i.e., the rightmost operand is the destination, and the leftmost one is the source. Other distinction is that register names used as operands must be preceded by a percent (%) sign. However, since version 2.10, Gas supports Intel syntax by means of the .intel_syntax directive. But in the following we’ll be using AT&T syntax.

2 Our Goals

What we’ll be doing is to create a new instance of a hello, world! program. Let’s recapitulate the articles we’ve studied so far. First, we presented some reminiscences and motivations for hello, world!. Next, we coded a hello, world! program by using the MS-DOS DEBUG program. Later, we encoded such program directly in hexadecimal (no need for DEBUG). And finally, we abused the MS-DOS ECHO command to create a binary, executable hello, world! program directly from the DOS command line (again no need for DEBUG.) A thing all these programs had in common was their use of the 09h function of INT 21h for printing the “hello, world!” string. But it’s time to move forward. Now I plan to use the lovely C printf function. In C, our greeting program would be

int main()
{
    printf("hello, world!\n");
    return 0;
}

We’ve omitted inclusion of the stdio.h header. We could recur to only one sentence: return printf("hello, world!\n") - 14; but I think that by using two sentences we’ll get a clearer code. We save our program in a file called “hello.c”, and compile with

gcc -o hello.exe hello.c

I’ll be working on Windows, with the MinGW port of the GNU Compiler Collection. I like MinGW a lot, specially its ability to provide native functionality via direct Windows API calls, which is good for performance of our applications. Working in Windows means that our executable files (object code and DLLs too) follow the PE/COFF format. The Portable Executable (PE) file format is a wrapper for all the information the Windows loader requires in order to run the code. PE is a modified version of the Unix COFF file format (hence the reference PE/COFF.) Other popular file format for executable code is ELF (Executable and Linkable Format), which is used by Linux, the Nintendo Wii and DS, and the PlayStation 3. For the time being, we only have to know that the behavior of GNU as varies according to the target file format (in our case, PE/COFF.)

gcc can also provide us with the x86 assembly file it used. I typed gcc -S hello.c and this was the output I got:

.file    "hello.c"
    .def    ___main;    .scl    2;    .type 32;    .endef
    .section .rdata,"dr"
LC0:
    .ascii "hello, world!\12\0"
.text
.globl _main
    .def    _main;    .scl    2;    .type    32;    .endef
_main:
    pushl    %ebp
    movl    %esp, %ebp
    subl    $8, %esp
    andl    $-16, %esp
    movl    $0, %eax
    addl    $15, %eax
    addl    $15, %eax
    shrl    $4, %eax
    sall    $4, %eax
    movl    %eax, -4(%ebp)
    movl    -4(%ebp), %eax
    call    __alloca
    call    ___main
    movl    $LC0, (%esp)
    call    _printf
    movl    $0, %eax
    leave
    ret
    .def    _printf;    .scl    2;    .type    32;    .endef

3 Code Explanations

From a general view, we identify 3 elements in the above listing. First, we have directives, which are symbols beginning with a ‘.’ (dot.) As aforesaid, directives are typically valid for any computer. If the symbol begins with a letter the statement is an assembly language instruction, i.e., it will assemble into a machine language instruction, and surely will differ between computer architectures. Finally, labels are those symbols immediately followed by a ‘:’ (colon.) We may think of labels as “directions” for data or code. Now let’s do a shallow review of a few germane directives, so bear with me.

.file string

This directive identifies the start of the logical file (and string should be the file name.) Actually, the directive is ignored and is only there for compatibility with old versions. We can remove it.

.def name … .endef

This pair of directives enclose debugging information for the symbol name, and are only observed when Gas is configured for PE/COFF format output. But we don’t need it for a simple hello, world! program.

.section name

This directive indicates that the following code has to be assembled into a section called name. For PE/COFF targets, the .section directive is used in one of the following ways:

.section name [, "flags"]
.section name [, subsegment]

The gcc’s output we’ve got recurs to the form with flags, and specifically, two flags (single character) are used to indicate the attributes of the section: d (data section) and r (read-only section.) But again, we don’t need to explicitly signal section attributes for our simple program.

.ascii “string”

Defines one or more string literals (separated by commas.) Each string is assembled into consecutive addresses (with no trailing zero character.)

.text subsection

Tells Gas to assemble the following statements onto the end of the text subsection numbered subsection. If subsection is omitted (as it’s our case), subsection number zero is used. Clearly, this directive is mandatory, or Gas will not assemble the code to print our hello, world! message.

.global symbol (or .globl symbol)

.global makes symbol visible to the linker. In our case, we want to inform the linker about the _main function that it is expecting. For compatibility with other assemblers, both spellings (.global or .globl) are valid.

Now, directives are done. After label _main we only have assembly code up to the ret instruction. Some of this code should be clear if you have previous experience with assembly programming. Nevertheless, let’s review these instructions too. Note that the ‘l’ on the end of each mnemonic tells Gas that we want to use the version of the instruction that works with “long” (32-bit) operands.

First 3 instructions are typical code for stack initialization:

pushl   %ebp
movl    %esp, %ebp
subl    $8, %esp

By subtracting 8 bytes from ESP we’re reserving the space on the stack to hold local variables (the Intel stack “grows” from high memory locations to the lower ones.) Next we have the rarer

andl	$-16, %esp

Remember that in hexadecimal, -16 is expressed as 0xFFFFFFF0. Therefore, this and aligns the stack with the next lowest 16-byte address. The reasons for this alignment are not very clear to me. It may be a gcc choice in order to accelerate floating point accesses, or it may be for compatibility with a particular architecture. Any of these, we don’t require such alignment for displaying hello, world!

The following code is mostly a very contrived way of storing a value in EAX:

movl	$0, %eax
addl	$15, %eax
addl	$15, %eax
shrl	$4, %eax
sall	$4, %eax
movl	%eax, -4(%ebp)
movl	-4(%ebp), %eax

Clearly the code is not optimized as there are a lot of unnecessary lines. Moreover, final EAX’s value is also stored into memory previously reserved on the stack. It seems the value in EAX is a parameter for the _alloca invocation in the two following lines:

call	__alloca
call	___main

These two calls are unnecessary for our toy application. We won’t delve into details, but I’ll say the alloca() is a function used to allocate memory on the stack. And if PE/COFF binaries are used, and our application has an int main() function, then a function void __main() should be called first thing after entering main(). We’ll leave it at that for now. More information can be found in this excellent and instructive article from OSDevWiki.

At last, we find the useful code

movl    $LC0, (%esp)
call    _printf

It moves the address of the ascii string into the stack, and invokes printf. Now, where’s the definition of printf? Well, we’ll take it from the C library, of course. The linker (ld) is responsible of associating our code with the definition of printf.

Finally, we found

movl    $0, %eax
leave
ret

These instructions constitute the “returning code.” Store the return value (0 == success!) in EAX, destroy the stack, and pop the saved Instruction Pointer from the stack in order to return control to the calling procedure or program.

If we strip all the unnecessary lines, our hello, world! would acquire this form:

.data
LC0:
    .ascii "hello, world!\n\0"
.text
    .global _main
_main:
    pushl    %ebp
    movl    %esp, %ebp
    subl    $4, %esp
    movl    $LC0, (%esp)
    call    _printf
    movl    $0, %eax
    leave
    ret

Shorter and clearer. I assembled the hard way, step by step:

as -o hello.o hello.s

ld -o hello.exe
/mingw/lib/crt2.o
C:/MinGW/bin/../lib/gcc/mingw32/3.4.5/crtbegin.o
-LC:/MinGW/bin/../lib/gcc/mingw32/3.4.5
-LC:/MinGW/lib hello.o
-lmingw32 -lgcc -lmsvcrt -lkernel32
C:/MinGW/bin/../lib/gcc/mingw32/3.4.5/crtend.o

But it’s better to just type gcc -o hello.exe hello.s :-)

This entry was posted in programming and tagged , , , , , , , , , , , , , , , , , , , . Bookmark the permalink. Both comments and trackbacks are currently closed.

5 Comments