4

I have the following C source file, with some asm blocks that implement a print and exit routine by calling DOS system calls.

__asm__(
    ".code16gcc;"
    "call dosmain;"
    "mov $0x4C, %AH;"
    "int $0x21;"
);

void print(char *str)
{
    __asm__(
        "mov $0x09, %%ah;"
        "int $0x21;"
        : // no output
        : "d"(str)
        : "ah"
    );
}

void dosmain()
{
    // DOS system call expects strings to be terminated by $.
    print("Hello world$");
}

The linker script file and the build script file are as such,

OUTPUT_FORMAT(binary)
SECTIONS
{
    . = 0x0100;
    .text :
    {
        *(.text);
    }
    .data :
    {
        *(.data);
        *(.bss);
        *(.rodata);
    }
    _heap = ALIGN(4);
}
gcc -fno-pie -Os -nostdlib -ffreestanding -m16 -march=i386 \
-Wl,--nmagic,--script=simple_dos.ld simple_dos.c -o simple_dos.com

I am used to building .COM files in assembly, and I am aware of the structure of a dos file. However in case of the .COM file generated using GCC, I am getting some extra bytes at the end and I cannot figure out why. (The bytes that are inside the shaded area and the box below is what is expected everything else is unaccounted).

enter image description here

[enter image description here]

My hunch is that these are some static storage used by GCC. I thought this could be due to the string in the program. I have thus commented the line print("Hello world$"); but the extra bytes still reside. It will be of great help if someone knows what is going on and tell how to prevent GCC inserting these bytes in the output.

Source code is available here: Github

PS: Object file also contains these extra bytes.

Michael Petch
  • 42,023
  • 8
  • 87
  • 158
Arjob Mukherjee
  • 337
  • 1
  • 7
  • Side note: Your print function isn't safe. Doesn't `int $0x21` return in AL? But you didn't tell the compiler about that, only AH. Better tell it the whole EAX is clobbered. – Peter Cordes May 29 '19 at 20:50
  • @PeterCordes That is right. Thanks. Do you have any idea why the extra bytes? – Arjob Mukherjee May 29 '19 at 21:04
  • No idea, I'm not that familiar with linker scripts. Did you check the `.o` to see if those bytes are present in the object file? I notice you didn't specify a `--oformat` option, though, so you haven't told LD to make a flat binary with no metadata. – Peter Cordes May 29 '19 at 21:15
  • @PeterCordes Yes they are in the object file as well. – Arjob Mukherjee May 29 '19 at 21:21
  • I wouldn't use `gcc` to link the executable and instead just invoke `ld` directly. That way you don't have to worry about `gcc` adding any extra object files or libraries. – Ross Ridge May 29 '19 at 23:17
  • @RossRidge: `-nostdlib` leaves out *everything* including `libgcc`. It might not imply `-static` without `-no-pie`, though. Anyway, worth trying to remove possible but unlikely causes, but unless it turns out to be the problem `gcc -nostdlib` should be fine in general, if you won't mind passing all your linker options via `-Wl` – Peter Cordes May 30 '19 at 10:51

2 Answers2

3

Since you are using a native compiler and not an i686(or i386) cross compiler you can get a fair amount of extra information. It is rather dependent on the compiler configurations. I would recommend doing the following to remove unwanted code generation and sections:

  • Use GCC option -fno-asynchronous-unwind-tables to eliminate any .eh_frame sections. This is the cause of the unwanted data appended at the end of your DOS COM program in this case.
  • Use GCC option -static to build without relocations to avoid any form of dynamic linking.
  • Have GCC pass the --build-id=none option to the linker with -Wl to avoid unnecessarily generating any .note.gnu.build-id sections.
  • Modify the linker script to DISCARD any .comment sections.

Your build command could look like:

gcc -fno-pie -static -Os -nostdlib -fno-asynchronous-unwind-tables -ffreestanding \
-m16 -march=i386 -Wl,--build-id=none,--nmagic,--script=simple_dos.ld simple_dos.c \
-o simple_dos.com

I would modify your linker script to look like:

OUTPUT_FORMAT(binary)
SECTIONS
{
    . = 0x0100;
    .text :
    {
        *(.text*);
    }
    .data :
    {
        *(.data);
        *(.rodata*);
        *(.bss);
        *(COMMON)
    }
    _heap = ALIGN(4);

    /DISCARD/ : { *(.comment); }
}

Besides adding a /DISCARD/ directive to eliminate any .comment sections I also add *(COMMON) along side .bss. Both are BSS sections. I have also moved them after the data sections as they won't take up space in the .COM file if they appear after the other sections. I also changed *(.rodata); to *(.rodata*); and *(.text); to *(.text*); because GCC can generate section names that begin with .rodata and .text but have different suffixes on them.


Inline Assembly

Not related to the problem you asked about, but is important. In this inline assembly:

__asm__(
    "mov $0x09, %%ah;"
    "int $0x21;"
    : // no output
    : "d"(str)
    : "ah"
);

Int 21h/AH=9h also clobbers AL. You should use ax as the clobber.

Since you are passing the address of an array through a register you will also want to add a memory clobber so that the compiler realizes the entire array into memory before your inline assembly is emitted. The constraint "d"(str) only tells the compiler that you will be using the pointer as input, not what the pointer points at.

Likely if you compiled with optimisations at -O3 you'd probably discover the following version of the program doesn't even have your string "Hello world$" in it because of this bug:

__asm__(
        ".code16gcc;"
        "call dosmain;"
        "mov $0x4C, %AH;"
        "int $0x21;"
);

void print(char *str)
{
        __asm__(
                "mov $0x09, %%ah;"
                "int $0x21;"
                : // no output
                : "d"(str)
                : "ax");
}

void dosmain()
{
        char hello[] = "Hello world$";
        print(hello);
}

The generated code for dosmain allocated space on the stack for the string but never put the string on the stack before printing the string:

00000100 <print-0xc>:
 100:   66 e8 12 00 00 00       calll  118 <dosmain>
 106:   b4 4c                   mov    $0x4c,%ah
 108:   cd 21                   int    $0x21
 10a:   66 90                   xchg   %eax,%eax

0000010c <print>:
 10c:   67 66 8b 54 24 04       mov    0x4(%esp),%edx
 112:   b4 09                   mov    $0x9,%ah
 114:   cd 21                   int    $0x21
 116:   66 c3                   retl

00000118 <dosmain>:
 118:   66 83 ec 10             sub    $0x10,%esp
 11c:   67 66 8d 54 24 03       lea    0x3(%esp),%edx
 122:   b4 09                   mov    $0x9,%ah
 124:   cd 21                   int    $0x21
 126:   66 83 c4 10             add    $0x10,%esp
 12a:   66 c3                   retl

If you change the inline assembly to include a "memory" clobber like this:

void print(char *str)
{
        __asm__(
                "mov $0x09, %%ah;"
                "int $0x21;"
                : // no output
                : "d"(str)
                : "ax", "memory");
}

The generated code may look similar to this:

00000100 <print-0xc>:
 100:   66 e8 12 00 00 00       calll  118 <dosmain>
 106:   b4 4c                   mov    $0x4c,%ah
 108:   cd 21                   int    $0x21
 10a:   66 90                   xchg   %eax,%eax

0000010c <print>:
 10c:   67 66 8b 54 24 04       mov    0x4(%esp),%edx
 112:   b4 09                   mov    $0x9,%ah
 114:   cd 21                   int    $0x21
 116:   66 c3                   retl

00000118 <dosmain>:
 118:   66 57                   push   %edi
 11a:   66 56                   push   %esi
 11c:   66 83 ec 10             sub    $0x10,%esp
 120:   67 66 8d 7c 24 03       lea    0x3(%esp),%edi
 126:   66 be 48 01 00 00       mov    $0x148,%esi
 12c:   66 b9 0d 00 00 00       mov    $0xd,%ecx
 132:   f3 a4                   rep movsb %ds:(%si),%es:(%di)
 134:   67 66 8d 54 24 03       lea    0x3(%esp),%edx
 13a:   b4 09                   mov    $0x9,%ah
 13c:   cd 21                   int    $0x21
 13e:   66 83 c4 10             add    $0x10,%esp
 142:   66 5e                   pop    %esi
 144:   66 5f                   pop    %edi
 146:   66 c3                   retl

Disassembly of section .rodata.str1.1:

00000148 <_heap-0x10>:
 148:   48                      dec    %ax
 149:   65 6c                   gs insb (%dx),%es:(%di)
 14b:   6c                      insb   (%dx),%es:(%di)
 14c:   6f                      outsw  %ds:(%si),(%dx)
 14d:   20 77 6f                and    %dh,0x6f(%bx)
 150:   72 6c                   jb     1be <_heap+0x66>
 152:   64 24 00                fs and $0x0,%al

An alternate version of the inline assembly that passes the sub function 9 via an a constraint using a variable and marks it as an input/output with + (since the return value of AX gets clobbered) could be done this way:

void print(char *str)
{
    unsigned short int write_fun = (0x09<<8) | 0x00;
    __asm__ __volatile__ (
        "int $0x21;"
        : "+a"(write_fun)
        : "d"(str)
        : "memory"
    );
}

Recommendation: Don't use GCC for 16-bit code generation. The inline assembly is difficult to get right and you will probably be using a fair amount of it for low level routines. You could look at Smaller C, Bruce's C compiler, or Openwatcom C as alternatives. All of them can generate DOS COM programs.

Michael Petch
  • 42,023
  • 8
  • 87
  • 158
  • 1
    The `.comment` sections don't have the ALLOC flag set so don't get output in the executable. – Ross Ridge May 30 '19 at 04:56
  • @RossRidge : That recommendation actually has more to do with not cluttering the corresponding ELF file (if they chose to generate it) with unnecessary sections that may confuse them or lead them to believe they became part of the binary output file as well. – Michael Petch May 30 '19 at 04:59
2

The extra data is likely DWARF unwind information. You can stop GCC from generating it with the -fno-asynchronous-unwind-tables option.

You can also have the GNU linker discard the unwind information by adding the following to SECTIONS directive of your linker script:

/DISCARD/ : 
{
     *(.eh_frame)
}

Also note that generated COM file will be one byte bigger than you expect because of the null byte at the end of the string.

Ross Ridge
  • 35,323
  • 6
  • 64
  • 105