For example, one filter (likely inspired by LZX, an earlier scheme with a similar feature) is called "Intel E8 preprocessing", which is designed to increase redundancy in x86 code.
If you imagine a program like this:
mov foo, bar
cmp bar, baz
push foo
call write
push bar
call write
The two calls are relative to the branch location, so there will be two different encodings of the same instruction. At compress time, we can translate them to absolute addresses. This way, the same instruction appears twice, and can therefore be compressed much more efficiently. When the archive is decompressed, this transformation can be reversed to restore the original.
WinRAR includes around a dozen standard filters that improve compression of several common inputs, but surprisingly also allows new filters to be defined at runtime by archives!
As far as I'm aware, no tool exists to explore this functionality. This is just too tempting to play with, so I've written some. So far, I have a RAR toolchain working, as well as some documentation on how to write programs.
You can checkout the code and some example programs on github, https://github.com/taviso/rarvmtools.
$ cat helloworld.rs
#include <constants.rh>
#include <util .rh>
; vim: syntax=fasm
; Test RAR assembly file.
_start:
; Install our message in the output buffer
mov r3, #0x1000 ; Output buffer.
mov [r3+#0], #0x6c6c6548 ; 'lleH'
mov [r3+#4], #0x57202c6f ; 'W ,o'
mov [r3+#8], #0x646c726f ; 'dlro'
mov [r3+#12], #0x00000a21 ; '!\n'
mov [VMADDR_NEWBLOCKPOS], r3 ; Pointer
mov [VMADDR_NEWBLOCKSIZE], #14 ; Size
call $_success
$ make helloworld.rar
cpp -I../stdlib < helloworld.rs > helloworld.ri
../raras -o helloworld.ro helloworld.ri
../rarld helloworld.ro > helloworld.rar
rm helloworld.ri helloworld.ro
$ unrar p -inul helloworld.rar
Hello, World!
At the moment, the toolchain consists of a minimal linker and assembler. I've been toying with the idea of porting a compiler (possibly an llvm backend, or gcc target), but I'm not sure if I'm that crazy.
Needless to say, programming something as ridiculous as WinRAR presents some interesting challenges, but the fact that it's so widely deployed presents some some fun possibilities. Here is a quick overview of the architecture.
Familiarity with x86 (and preferably Intel assembly syntax) would be an advantage, as the VM was clearly inspired by x86. There are 8 named registers, called r0 to r7.
r7 is used as a stack pointer for stack related operations (such as push, call, pop, etc), and grows down. A RarVM filter may execute for at most 250,000,000 instructions, at which
point it will be terminated. However there are no limits on the number of filters included in a file, so you can simply split your task into multiple 250M cycle chunks. If the instruction pointer ever moves outside the code segment, the program is considered to have completed
successfully.
Filters have a 0x40000 byte address space, and access to 3 status flags : ZF (Zero), CF (Carry) and SF (Sign) which can be accessed via the conditional branch instructions, or via pushf/popf (as on x86).
These are the operand types supported
- Registers (r0, r1, ..., r7)
- Memory references ([#0x12345])
- Register indirect references ([r0])
- Base + Index indirect references ([r4+#0x1234])
- Immediate (#0x12312)
As a convenience, the assembler also supports symbolic references that are resolved to immediates at compile time, just prefix them with a $.
- Symbolic references (jnz $next_loop)
There are no restrictions on how many memory references you can make (like on Intel), so this is perfectly legal:
mov [r0+#0x1234], [r7-#12]
RAR programs can either supply their own initial register values, or use the default set which includes some status information. If you don't specify your own registers, these will be set:
- r3: Address of Global Memory Buffer
- r4: Filter Block Length
- r5: Filter Execution Count
- r7: Top of address space (and thus the stack grows down).
Dynamic Output
So here's the fun part, because RAR is intended to be an archiver and not a programming environment, the Windows version of WinRAR requires the output from your program matches the CRC stored in the volume header....Ouch.That would normally mean that you can't write a program whose output you cannot predict at compile time, making things a little less interesting. However, there's no need to give up so easily.
Julien wrote a paper on dynamic CRC compensation algorithms (french, but the code is understandable). I've ported his algorithm to RarVM, so I now have a self-hosted RAR file that can produce dynamic output and compensate it's own output at runtime to match a fixed CRC.
Yes, that's pretty ridiculous.
Here is a portion of the CRC compensation code:
; Modifies a few bytes within the first 7 bytes of buffer to force the CRC32 to
; match the requested value. You should reserve the first few bytes of buffer
; for this process.
;
; Original implementation based on a paper by Julien Tinnes.
_compensate_crc:
; [r6+#16] target // Desired CRC value.
; [r6+#12] length // Size of buffer.
; [r6+#8] buffer // Pointer to buffer to compensate.
; [r6+#4] r
; [r6+#0] s
; [r6-#4] modlen // Length % 4
push r6 ; save frame pointer
mov r6, r7 ; create new frame
sub r7, #4 ; allocate variables
push [r6+#12] ; length
push [r6+#8] ; buffer
call $_crc_block ; _crc_block(buffer, length)
xor [r6+#16], r0 ; target ^= currentcrc
push #4 ; divisor
push [r6+#12] ; length
call $_mod ; _mod(length, 4)
mov [r6-#4], r0 ; save this calculation
sub [r6+#12], [r6-#4] ; length -= modlen
div [r6+#12], #4 ; divide by 4
sub [r6+#12], #1 ; calculate row for _find_vector_x
push [r6+#12] ; row
push [r6+#16] ; target
call $_find_vector_x ; _find_vector_x(target, row)
push r0 ; value
call $_bswap ; byte swap result
add [r6+#8], [r6-#4] ; calculate offset to modify
mov r1, [r6+#8] ; load address
xor [r1], r0 ; compensate crc
xor r0, r0 ; return code
mov r7, r6 ; clear frame
pop r6
ret
And the result:
$ cat sample.rs
#include
#include
#include
#include
; vim: syntax=fasm
; Test RAR assembly file that just demonstrates the syntax.
_start:
; Install our message in the output buffer
mov r3, #0x1000 ; Output buffer.
mov [r3+#0], #0x41414141 ; Padding for compensation
mov [r3+#4], #0x0a414141 ; Padding for compensation
mov [r3+#8], #0x6c6c6548 ; 'lleH'
mov [r3+#12], #0x57202c6f ; 'W ,o'
mov [r3+#16], #0x646c726f ; 'dlro'
mov [r3+#20], #0x00000a21 ; '!\n'
mov [VMADDR_NEWBLOCKPOS], #0x00001000
mov [VMADDR_NEWBLOCKSIZE], #22
; Compensate to required CRC
push RAR_FILECRC
push [VMADDR_NEWBLOCKSIZE]
push [VMADDR_NEWBLOCKPOS]
call $_compensate_crc
test r0, r0
jz $finished
call $_error
finished:
call $_success
$ make sample.rar
cpp -Istdlib < sample.rs > sample.ri
./raras -o sample.ro sample.ri
./rarld sample.ro > sample.rar
rm sample.ri sample.ro
$ unrar p -idq sample.rar
AAc��!A
Hello, World!
Considering the code is automatically executed during unrar...I wonder if it is possible to break out the sandbox? How constrained is the VM?
ReplyDeleteGood hacking.
ReplyDeleteWhat other calls are available unless 'write'?
ReplyDeleteI'm not entirely sure how or why, but after posting this to Facebook, when people visit the link, they get a security warning from facebook...
ReplyDeleteNot to tout my own horn too much, but I was a co-author on a paper that discusses achieving a given CRC in more abstract detail, but also gives a concrete C implementation for CRC-32 – in English, for those, like me, who don't speak French. :-) http://sar.informatik.hu-berlin.de/research/publications/SAR-PR-2006-05/SAR-PR-2006-05_.pdf
ReplyDeleteI can't figure out how you're generating that output I don't see any calls to a print function. Is output supposed to reside at a fixed location? Could you explain?
ReplyDeleteNever mind, I see the output buffer now. How can you do anything but write to stdout though, for instance, how would you output decompressed data?
ReplyDeleteI have been messing with this for a while, though you have surpassed me. Can you please write me? I cannot find a link to your email address anywhere. I just wanted to discuss your amazing find. Thanks. Random
ReplyDeleteYou can email me thru my site:
http://www.TheLegendOfRandom.com
Bravo, sir. Bravo! Madness like this is great fun, and you have earned mad kudos!
ReplyDeleteThis comment has been removed by the author.
ReplyDelete