Tuesday, 28 February 2017

How to build Grub 0.97 in Debain 8 / gcc (Debian 4.9.2-10) 4.9.2

The Grub 0.97 is a bit legacy to recent Linux distributions. For example in my Debian 8 box, the ./configure produced
configure: error: GRUB requires a working absolute objcopy; upgrade your binutils
Figure 1

Apparently the binutils is the recent so it must be the Grub 0.97 out of sync. This problem can be quickly resolved by googling around that an option should be attached to objcopy. So in the "configure" change this line
if { ac_try='${OBJCOPY-objcopy} -O binary conftest.exec conftest'
to
 if { ac_try='${OBJCOPY-objcopy} -R .note.gnu.build-id -O binary conftest.exec conftest'
After successfully running the ./configure command, however, you still need to edit every Makefiles in the Grub 0.97 directory and all its subdirectories, modify this line
OBJCOPY = @OBJCOPY@
to
OBJCOPY = @OBJCOPY@ --strip-unneeded -R .note -R .comment -R .note.gnu.build-id -R .reginfo -R .rel.dyn -R .note.gnu.gold-version
Otherwise the final images will be incredibly huge, like that in Figure 2, a 134MB stage1 instead of the normal 512 bytes:

Figure 2

It looks so far so good. The real problem started from here. Once written the Grub into a floppy or ISO file, the virtual machine displayed "GRUB Loading stage2 ..." and  die:

Figure 3

After several minor adjusting the parameters of the objcopy, I decided to have a close look at the startup codes in Grub. There was the dying message about "Loading stage2", so stage1 must be good. This string can be found in start.S and it was displayed in the beginning of the _start() so I looked at the bottom of the function:
bootit:
        /* print a newline */
        MSG(notification_done)
        popw    %dx     /* this makes sure %dl is our "boot" drive */
#ifdef STAGE1_5
        ljmp    $0, $0x2200
#else /* ! STAGE1_5 */
        ljmp    $0, $0x8200
#endif /* ! STAGE1_5 */
...
notification_step:      .string "."
notification_done:      .string "\r\n"

Well, it looks the _start() worked pretty well. It loaded the second stage and long jump to the entry. So what's the damage? I moved back to the make and link process and found this piece of logs:
 gcc -Os -fno-stack-protector -fno-builtin -nostdinc  -DSUPPORT_SERIAL=1 -DSUPPORT_HERCULES=1 -DHAVE_CONFIG_H -I. -I. -I.. -I../stage1 -Wall -Wmissing-prototypes -Wunused -Wshadow -Wpointer-arith -falign-jumps=1 -falign-loops=1 -falign-functions=1 -Wundef -g -c -o start_exec-start.o `test -f 'start.S' || echo './'`start.S
gcc  -g   -o start.exec -nostdlib -Wl,-N -Wl,-Ttext -Wl,8000 start_exec-start.o
...
gcc  -g   -o pre_stage2.exec -nostdlib -Wl,-N -Wl,-Ttext -Wl,8200 pre_stage2_exec-asm.o pre_stage2_exec-bios.o pre_stage2_exec-boot.o pre_stage2_exec-builtins.o pre_stage2_exec-char_io.o pre_stage2_exec-cmdline.o pre_stage2_exec-common.o pre_stage2_exec-console.o pre_stage2_exec-disk_io.o pre_stage2_exec-fsys_ext2fs.o pre_stage2_exec-fsys_fat.o pre_stage2_exec-fsys_ffs.o pre_stage2_exec-fsys_iso9660.o pre_stage2_exec-fsys_jfs.o pre_stage2_exec-fsys_minix.o pre_stage2_exec-fsys_reiserfs.o pre_stage2_exec-fsys_ufs2.o pre_stage2_exec-fsys_vstafs.o pre_stage2_exec-fsys_xfs.o pre_stage2_exec-gunzip.o pre_stage2_exec-hercules.o pre_stage2_exec-md5.o pre_stage2_exec-serial.o pre_stage2_exec-smp-imps.o pre_stage2_exec-stage2.o pre_stage2_exec-terminfo.o pre_stage2_exec-tparm.o
objcopy --strip-unneeded -R .note -R .comment -R .note.gnu.build-id -R .reginfo -R .rel.dyn -R .note.gnu.gold-version -O binary pre_stage2.exec pre_stage2
...
objcopy --strip-unneeded -R .note -R .comment -R .note.gnu.build-id -R .reginfo -R .rel.dyn -R .note.gnu.gold-version -O binary start.exec start
...
cat start pre_stage2 > stage2

In simple, the stage2 binary was combined by two binaries, the 'start' and 'pre_stage2'. The 'start' is a size of 0x200 bytes bootup binary and targets to 0x8000. The 'pre_stage2' is just right behind 'start'  so it starts from 0x8200, like the linker profile stated. Have a look at asm.S:
start:
_start:

ENTRY(main)
        /*
         *  Guarantee that "main" is loaded at 0x0:0x8200 in stage2 and
         *  at 0x0:0x2200 in stage1.5.
         */
        ljmp $0, $ABS(codestart)

        . = EXT_C(main) + 0x6
        .byte   COMPAT_VERSION_MAJOR, COMPAT_VERSION_MINOR

        /*
         *  This is a special data area 8 bytes from the beginning.
         */

        . = EXT_C(main) + 0x8

VARIABLE(install_partition)
        .long   0xFFFFFF
/* This variable is here only because of a historical reason.  */
VARIABLE(saved_entryno)
        .long   0
VARIABLE(stage2_id)
        .byte   STAGE2_ID
But in the problematic binary:

Figure 4


It doesn't like a jump code and there are no version and magic 0xffffff bytes either. On the contrary I found the proper binary at ox83f8:

Figure 5

The mystery was partly resolved by disassembling the 'pre_stage2.exec'. It looked like this:
00008200 <lba_to_chs.2277>:
8200:       55                      push   %ebp
8201:       89 e5                   mov    %esp,%ebp
8203:       57                      push   %edi
8204:       56                      push   %esi
8205:       53                      push   %ebx
8206:       53                      push   %ebx
...

0000826a <journal_init>:
826a:       55                      push   %ebp
826b:       ba 0c 00 00 00          mov    $0xc,%edx
8270:       89 e5                   mov    %esp,%ebp
8272:       57                      push   %edi
8273:       56                      push   %esi
8274:       53                      push   %ebx
8275:       8d 8d dc df ff ff       lea    -0x2024(%ebp),%ecx
827b:       81 ec 2c 20 00 00       sub    $0x202c,%esp
...

000083f8 <_start>:
83f8:       ea 70 82 00 00 00 03    ljmp   $0x300,$0x8270
83ff:       02 ff                   add    %bh,%bh
The asm.o was supposed to be linked into the start address. For somewhat reason, these two functions 'lba_to_chs' and 'journal_init' was inserted ahead of '_start'. I reckon different version or build of linker has different optimizing strategy. My linker was just over optimized.

However I googled as deep as I could, I have also searched the full manpage of the ld, I still could not find the proper option to turn off the relating optimization. It looks it can not be handled by simple command line options but only by complex link scripts, which is my least intention to do so. After two days frustration, I decide to make a workaround in the source codes by changing the scope of those functions. It only involves fsys_reiserfs.c and builtins.c. In fsys_reiserfs.c, change
static int
journal_init (void)
to
int
journal_init (void)
and in builtins.c, relocate the whole implementation of
void lba_to_chs (int lba, int *cl, int *ch, int *dh)
out of partnew_func(). After rebuild, the disassembly of pre_stage2.exec and reiserfs_stage1_5 are all correct now:
00008200 <_start>:
    8200:       ea 70 82 00 00 00 03    ljmp   $0x300,$0x8270
    8207:       02 ff                   add    %bh,%bh

00008208 <install_partition>:
    8208:       ff                      (bad)
    8209:       ff                      (bad)
    820a:       ff 00                   incl   (%eax)
Enjoy the Grub legacy
Figure 6
 


A patch for Grub 0.97 in Debian Jessie can be found here.


Install VMware Player 6.0.3 in CentOS 7

Recently found an old machine in a dusty corner, an Antholon64 2600+, slow but support hyper VT, perfect for running a virtual machine.

It appears good that AMD support high desity DDR2 so can buy some cheap memory in ebay, $40 for 8GB, only half price of normal memory.

Installed the newest CentOS 7, too new to find packages in repo. finally managed to get most of them done.

Downloaded VMware-Player-6.0.3-1895310.x86_64.bundle from vmware official site.

Install fine. First running failed. Check the log file in /tmp/vmware-root/vmware-modconfig-16392.log:

2014-07-25T21:29:05.141+10:00| vthread-3| I120: using /usr/bin/gcc for preprocess check
2014-07-25T21:29:05.153+10:00| vthread-3| I120: Preprocessed UTS_RELEASE, got value "3.10.0-123.el7.x86_64".
2014-07-25T21:29:05.153+10:00| vthread-3| I120: The header path "/lib/modules/3.10.0-123.el7.x86_64/build/include" for the kernel "3.10.0-123.el7.x86_64" is valid.  Whoohoo!
2014-07-25T21:29:05.181+10:00| vthread-3| I120: The GCC version matches the kernel GCC minor version like a glove.
2014-07-25T21:29:05.181+10:00| vthread-3| I120: Validating path "/lib/modules/3.10.0-123.el7.x86_64/build/include" for kernel release "3.10.0-123.el7.x86_64".
2014-07-25T21:29:05.181+10:00| vthread-3| I120: Failed to find /lib/modules/3.10.0-123.el7.x86_64/build/include/linux/version.h
2014-07-25T21:29:05.181+10:00| vthread-3| I120: /lib/modules/3.10.0-123.el7.x86_64/build/include/linux/version.h not found, looking for generated/uapi/linux/version.h instead.
2014-07-25T21:29:05.181+10:00| vthread-3| I120: using /usr/bin/gcc for preprocess check
2014-07-25T21:29:05.197+10:00| vthread-3| I120: Preprocessed UTS_RELEASE, got value "3.10.0-123.el7.x86_64".
2014-07-25T21:29:05.197+10:00| vthread-3| I120: The header path "/lib/modules/3.10.0-123.el7.x86_64/build/include" for the kernel "3.10.0-123.el7.x86_64" is valid.  Whoohoo!
2014-07-25T21:29:05.197+10:00| vthread-3| I120: Using temp dir "/tmp".
2014-07-25T21:29:05.199+10:00| vthread-3| I120: Obtaining info using the running kernel.
2014-07-25T21:29:05.199+10:00| vthread-3| I120: Setting header path for 3.10.0-123.el7.x86_64 to "/lib/modules/3.10.0-123.el7.x86_64/build/include".
2014-07-25T21:29:05.199+10:00| vthread-3| I120: Validating path "/lib/modules/3.10.0-123.el7.x86_64/build/include" for kernel release "3.10.0-123.el7.x86_64".
2014-07-25T21:29:05.199+10:00| vthread-3| I120: Failed to find /lib/modules/3.10.0-123.el7.x86_64/build/include/linux/version.h
2014-07-25T21:29:05.199+10:00| vthread-3| I120: /lib/modules/3.10.0-123.el7.x86_64/build/include/linux/version.h not found, looking for generated/uapi/linux/version.h instead.
2014-07-25T21:29:05.199+10:00| vthread-3| I120: using /usr/bin/gcc for preprocess check
2014-07-25T21:29:05.211+10:00| vthread-3| I120: Preprocessed UTS_RELEASE, got value "3.10.0-123.el7.x86_64".
2014-07-25T21:29:05.211+10:00| vthread-3| I120: The header path "/lib/modules/3.10.0-123.el7.x86_64/build/include" for the kernel "3.10.0-123.el7.x86_64" is valid.  Whoohoo!
2014-07-25T21:29:05.435+10:00| vthread-3| I120: Invoking modinfo on "vmnet".
2014-07-25T21:29:05.440+10:00| vthread-3| I120: "/sbin/modinfo" exited with status 256.
2014-07-25T21:29:05.615+10:00| vthread-3| I120: Setting destination path for vmnet to "/lib/modules/3.10.0-123.el7.x86_64/misc/vmnet.ko".
2014-07-25T21:29:05.615+10:00| vthread-3| I120: Extracting the vmnet source from "/usr/lib/vmware/modules/source/vmnet.tar".
2014-07-25T21:29:05.681+10:00| vthread-3| I120: Successfully extracted the vmnet source.
2014-07-25T21:29:05.682+10:00| vthread-3| I120: Building module with command "/usr/bin/make -j2 -C /tmp/modconfig-j0FvbF/vmnet-only auto-build HEADER_DIR=/lib/modules/3.10.0-123.el7.x86_64/build/include CC=/usr/bin/gcc IS_GCC_3=no"
2014-07-25T21:29:12.543+10:00| vthread-3| W110: Failed to build vmnet.  Failed to execute the build command.
In the beginning I was attracted by the missing linux/version.h, yet found it's not a big deal. then move to /usr/lib/vmware/modules/source/vmnet.tar

root:/usr/lib/vmware/modules/source# ls
vmblock.tar  vmci.tar  vmmon.tar  vmnet.tar  vsock.tar
root:/usr/lib/vmware/modules/source# tar xf vmnet.tar
root:/usr/lib/vmware/modules/source# ls
vmblock.tar  vmci.tar  vmmon.tar  vmnet-only  vmnet.tar  vsock.tar
root:/usr/lib/vmware/modules/source# mv vmnet.tar vmnet.tar.buggy
root:/usr/lib/vmware/modules/source# cd vmnet-only
root:/usr/lib/vmware/modules/source/vmnet-only# ls
bridge.c            driver.c         monitorAction_exported.h  smac_compat.c          vm_basic_defs.h      vnetEvent.h
community_source.h  driver-config.h  netdev_has_dev_net.c      smac_compat.h          vm_basic_types.h     vnetFilter.h
compat_autoconf.h   filter.c         netdev_has_net.c          smac.h                 vm_device_version.h  vnetFilterInt.h
compat_module.h     geninclude.c     net.h                     userif.c               vmnetInt.h           vnet.h
compat_netdevice.h  hub.c            netif.c                   vm_assert.h            vm_oui.h             vnetInt.h
compat_skbuff.h     includeCheck.h   nfhook_uses_skb.c         vm_atomic.h            vmware_pack_begin.h  vnetKernel.h
compat_sock.h       Makefile         procfs.c                  vm_basic_asm.h         vmware_pack_end.h    vnetUserListener.c
compat_version.h    Makefile.kernel  skblin.c                  vm_basic_asm_x86_64.h  vmware_pack_init.h   x86cpuid.h
COPYING             Makefile.normal  smac.c                    vm_basic_asm_x86.h     vnetEvent.c

not bad, there's source codes. try to compile it

root:/usr/lib/vmware/modules/source/vmnet-only# make
Using 2.6.x kernel build system.
make -C /lib/modules/3.10.0-123.el7.x86_64/build/include/.. SUBDIRS=$PWD SRCROOT=$PWD/. \
  MODULEBUILDDIR= modules
make[1]: Entering directory `/usr/src/kernels/3.10.0-123.el7.x86_64'
  CC [M]  /usr/lib/vmware/modules/source/vmnet-only/driver.o
  CC [M]  /usr/lib/vmware/modules/source/vmnet-only/hub.o
  CC [M]  /usr/lib/vmware/modules/source/vmnet-only/userif.o
  CC [M]  /usr/lib/vmware/modules/source/vmnet-only/netif.o
  CC [M]  /usr/lib/vmware/modules/source/vmnet-only/bridge.o
  CC [M]  /usr/lib/vmware/modules/source/vmnet-only/filter.o
/usr/lib/vmware/modules/source/vmnet-only/filter.c:209:1: error: conflicting types for ‘VNetFilterHookFn’
 VNetFilterHookFn(unsigned int hooknum,                 // IN:
 ^
/usr/lib/vmware/modules/source/vmnet-only/filter.c:64:18: note: previous declaration of ‘VNetFilterHookFn’ was here
 static nf_hookfn VNetFilterHookFn;
                  ^
/usr/lib/vmware/modules/source/vmnet-only/filter.c:64:18: warning: ‘VNetFilterHookFn’ used but never defined [enabled by default]
/usr/lib/vmware/modules/source/vmnet-only/filter.c:209:1: warning: ‘VNetFilterHookFn’ defined but not used [-Wunused-function]
 VNetFilterHookFn(unsigned int hooknum,                 // IN:
 ^
make[2]: *** [/usr/lib/vmware/modules/source/vmnet-only/filter.o] Error 1
make[1]: *** [_module_/usr/lib/vmware/modules/source/vmnet-only] Error 2
make[1]: Leaving directory `/usr/src/kernels/3.10.0-123.el7.x86_64'
make: *** [vmnet.ko] Error 2
root:/usr/lib/vmware/modules/source/vmnet-only#

Googled the VNetFilterHookFn function. It was said since 3.13.5, the netfilter hook function (nf_hookfn) has been changed as the result of code refactoring. However, VNetFilterHookFn is still using the old definition of nf_hookfn. So the fix is t change the definition of VNetFilterHookFn to match the current kernel. vi-ed filter.c and 209gg, found this:


#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 13, 0)
VNetFilterHookFn(const struct nf_hook_ops *ops,        // IN:
#else
VNetFilterHookFn(unsigned int hooknum,                 // IN:
#endif

what the hell, the patch has already been there. hold on, what's CentOS 7's version?

andy:~$ uname -a
Linux localhost.localdomain 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09:22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

So that's the reason. The kernel is lower than 3.13.0 so filter.c used the old definition. However RHEL7 must have applied some recent patches so the netfilter hook function has moved to the new one. Let's see how many KERNEL_VERSION have been watched

root:/usr/lib/vmware/modules/source/vmnet-only# grep KERNEL_VERSION *.c
bridge.c:#if LINUX_VERSION_CODE >= KERNEL_VERSION(2, 6, 22)
bridge.c:#if defined(NETIF_F_GSO) || LINUX_VERSION_CODE >= KERNEL_VERSION(2, 6, 18)
driver.c:#if LINUX_VERSION_CODE >= KERNEL_VERSION(2, 4, 8)
filter.c:#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 13, 0)
filter.c:#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 13, 0)
netdev_has_dev_net.c:#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 25)
netdev_has_dev_net.c:#elif LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 26)
netdev_has_net.c:#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 23)
netdev_has_net.c:#elif LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24)
nfhook_uses_skb.c:#if LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 23)
nfhook_uses_skb.c:#elif LINUX_VERSION_CODE < KERNEL_VERSION(2, 6, 24)
procfs.c:#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 10, 0)
procfs.c:#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 10, 0)
procfs.c:#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 10, 0)
skblin.c:#if LINUX_VERSION_CODE <= KERNEL_VERSION(2, 6, 17)

Only two in the filter.c required 3.13.0. Ok, changed them to

#if LINUX_VERSION_CODE >= KERNEL_VERSION(3, 10, 0)

then

root:/usr/lib/vmware/modules/source/vmnet-only# make
Using 2.6.x kernel build system.
make -C /lib/modules/3.10.0-123.el7.x86_64/build/include/.. SUBDIRS=$PWD SRCROOT=$PWD/. \
  MODULEBUILDDIR= modules
make[1]: Entering directory `/usr/src/kernels/3.10.0-123.el7.x86_64'
  CC [M]  /usr/lib/vmware/modules/source/vmnet-only/filter.o
  CC [M]  /usr/lib/vmware/modules/source/vmnet-only/procfs.o
  CC [M]  /usr/lib/vmware/modules/source/vmnet-only/smac_compat.o
  CC [M]  /usr/lib/vmware/modules/source/vmnet-only/smac.o
  CC [M]  /usr/lib/vmware/modules/source/vmnet-only/vnetEvent.o
  CC [M]  /usr/lib/vmware/modules/source/vmnet-only/vnetUserListener.o
  LD [M]  /usr/lib/vmware/modules/source/vmnet-only/vmnet.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /usr/lib/vmware/modules/source/vmnet-only/vmnet.mod.o
  LD [M]  /usr/lib/vmware/modules/source/vmnet-only/vmnet.ko
make[1]: Leaving directory `/usr/src/kernels/3.10.0-123.el7.x86_64'
make -C $PWD SRCROOT=$PWD/. \
  MODULEBUILDDIR= postbuild
make[1]: Entering directory `/usr/lib/vmware/modules/source/vmnet-only'
make[1]: `postbuild' is up to date.
make[1]: Leaving directory `/usr/lib/vmware/modules/source/vmnet-only'
cp -f vmnet.ko ./../vmnet.o

:-D

Now repack the vmnet.tar. Since vmware runtime will try to make the package, there's no need to clean the objects, probably could boost the install speed

root:/usr/lib/vmware/modules/source# tar cf vmnet.tar vmnet-only
root:/usr/lib/vmware/modules/source# ls
vmblock.tar  vmci.tar  vmmon.tar  vmnet.o  vmnet-only  vmnet.tar  vmnet.tar.buggy  vsock.tar

Then run the vmware player and cross fingers ...... oh ye! succeed! Run some VMs all good.

Socket programming: when clients shut down

Have not been touching socket program for years. Recently wish to add log functions to libcsoup. The idea is transferring information to remote client, if there is. So the model is quite simple, I suppose, open a socket, listen on it, send to remote machine, if sending failed close the socket.

First problem is I have to open a thread to listen on the socket. The library doesn't have main thread so could not use select or poll. Signal model is even worse. The object structure has to be put in global to pass into the event handler.

The real trouble is when I use send() like this:

if (send(sockfd, buf, len, 0) < 0) {
    switch (errno) {
        case ENOTSOCK:...
            break;
        case EPIPE:...
            break;
    ...
}

The whole program exits instead of catching the error condition when the peer closes. Something deja vu so I did a quick search. The answer is SIGPIPE. When writing into a peer closed socket, the system send the SIGPIPE to the program. Apparently my sandbox doesn't prepare for the signal so the default behaviour is to exit the whole program.

Don't want to catch the signal because the program is part of a library. Reroute the signal process may cause other parts trouble. Read the man page of send() again. Found the last argument of send(),  flags, actually defined MSG_NOSIGNAL to avoid the signal:

MSG_NOSIGNAL (since Linux 2.2)
              Requests  not to send SIGPIPE on errors on stream oriented sockets when the other end breaks the
              connection.  The EPIPE error is still returned.

So change the code to:

if (send(sockfd, buf, len, MSG_NOSIGNAL) < 0) {
    switch (errno) {
        case ENOTSOCK:...
            break;
        case EPIPE:...
            break;
    ...
}

It works, hooray!