The vast majority of work on comparing program similarities to detect software piracy either assumes the availability of the program source code (e.g., Moss) or performs a complicated source program transformation to embed carefully designed signatures, or software watermarks, into the binary code. In this paper, we propose a new approach to detecting program similarities that requires neither the availability of the program source nor complicated compile-time watermarking techniques. Furthermore, in contrast to the alternatives, our framework is resistant to standard attacks such as code obfuscation. Our approach exploits the observation that the sequence of system calls performed by a program execution provides a strong signature of the program semantics or functionality, thereby using the inherent properties of a program to identify it. By statistically analyzing sequences of system calls, the relative similarities and differences of program regions can be automatically determined. We have developed a framework that automatically extracts system call sequences, computes the similarities between two binaries via statistical analysis, and maps dynamically similar regions onto textually similar source files. We present several case studies showing the applicability of our framework in pinpointing pirated segments. Our experimental study also shows that directly comparing the binary files of the programs without considering their dynamic behavior is ineffective, and demonstrates strong consistency between the output of our new framework and that of Moss.

Date of this Version

January 2008