0

The malware samples provided by Microsoft in the Kaggle challenge (https://www.kaggle.com/c/malware-classification/data) contain hexadecimal representation of the code segment. An example:

    00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
    00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
    00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
    00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
    00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80
    00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90
    00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19
    00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00
    00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00
    00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00
    ...

I want to convert them back to binary format, in order to further convert them to image (and also save space).

I tried xxd -r -p, but the output is incorrect. xxd somehow encode the address 00401000 as well, whereas I want to get rid of the address.

Is there a quick way to do this?

cchamberlain
  • 14,693
  • 6
  • 52
  • 67
user1734905
  • 323
  • 3
  • 12

1 Answers1

3

First you need to strip off the address numbers since they're not part of the code itself; they're like line numbers for hex code. I'd use awk for that. Then try using xxd -r -p again.
Awk syntax stolen from: Using awk to print all columns from the nth to the last
Try something like this (I don't have xxd handy so I couldn't test):

awk '{$1=""; print $0}' yourhexfile |xxd -r -p >aFileContainingActualCode
Community
  • 1
  • 1
LinuxDisciple
  • 1,935
  • 10
  • 18
  • should the binary file's size be 8*L (L is number of lines in the hex file)? I tried to do the conversion, but the final binary size is always 2-3 times larger than 8*L – user1734905 Mar 30 '16 at 05:03
  • 1
    Size should be 16*L ... each row has 16 bytes. – sessyargc.jp Mar 30 '16 at 05:11
  • the binary code's size should be a bit less than 1/3 of the hex file's size. 57 bytes of the hex file (8+16*3+CR) represent 16 bytes binary, so the factor hexfile / binfile should be about 350%. don't mix lines with filesize – Tommylee2k Mar 30 '16 at 07:58