This is the second of a three-part article on Software Piracy and contains practical tips for locating and identifying misappropriated software. This part will give a programmer, or someone familiar with software syntax, a running start if faced with the task of trying to identify copied software.
In the final analysis, you will have to look at two pieces of code side by side, but the following tools will help narrow down where to start looking. Once you have evidence of identical software, take a close look at the two sections of code.
WCopyfind – One of the primary tools I have found to be useful is the Open Source software WCopyfind. WCopyfind is a Windows-based plagiarism program that compares documents and reports similarities in words and phrases. While there are several tools on the market, I chose WCopyfind because it is a single executable program that can be installed and run on a local PC. As such it can be used to securely perform analysis of files that exist on a local hard drive. Many of the plagiarism detection tools are Cloud-based and require data to either be uploaded to a third party site for analysis or require that a SAS tool have access to files on a local hard drive. Because of the confidential issues involved as an expert witness, I require a program that will not risk the proprietary software of either company.
WCopyfind is not designed to detect plagiarism in software, but it is beneficial when looking for similar phrases, variable names and especially comments. I find that the best procedure is to copy most of the code files into two large files – one containing company “A” software and one containing the code from Company “B.” Then run WCopyfind on these two files and scroll through the output line by line. If there are many lines of code, this will require numerous passes with slightly different configurations.
A Basic Text Editor – Not a word processing package, but an excellent simple editor. One of my favorites is EditPlus, but there are a bunch that will do. I like an editor that will take any file, not try to format the text, allow for sorting, and let me put multiple files side by side. Also a macro function can come in handy.
As before, copy files from each company a single large file. Use the editor’s sort function sort the text and remove all leading spaces and blank lines. You can write a relatively simple program to do this, but a text editor with macro capabilities may be all you need. Once the two files have been sorted and have initial spaces removed, you can begin to compare them manually, but I find that another run through the plagiarism software will now spot some new similarities. Things like the number of asterisks, dashes or slashes used to break up sections of code or comments are often overlooked and will be found by a program like WCopyfind. If you discover that both pieces of software create a line from the same number of dashes, then these sections of code warrant closer scrutiny.
Visual Studio is Microsoft’s tool for creating software projects in many different languages. I use this tool initially to help discover if the defendant in a case has supplied all of the files they have. Frequently you will receive a massive dump of routines with no way of making sense of them all. Since Visual Studio is a development environment, loading all files for a particular product helps identify routines that are referenced but not supplied. Visual Studio is also an excellent editing environment with powerful search and formatting abilities.
Code Compare allows two files to be easily compared side by side. It indicates differences between the two files and is especially useful when comparing software when you have two versions of the same file.
If you have a lot of versions of the same files, Mercurial (along with TortoiseHg, a visual interface for Mercurial) or a similar version control software (VCS) package can come in very handy. “Check in” all of the versions of the files you have been given and then use the VCS’s reporting features and search tools to compare across versions quickly.
In fact, Version Control may hold the smoking gun. Examine the earliest versions of files, and look for differences between versions that indicate wholesale changes of variable names or removal or addition of comments. Be especially suspicious when you discover the removal of helpful comments from one version to the next. Developers typically do not change comments, to the point that a legacy comment will often persist from version to version even after the underlying code has been changed enough that the comment is no longer helpful.
You can also use the text editor on the large sorted files to delete the text on each line before the assignment operator (‘=’). Then re-sort the file and look for constants or equations that are similar or identical. Also look for constants (or #define statements) where the numeric values are identical. The name may have been changed completely, but if the value is the same, then the two constants may well be used in identical ways in the software. Find the constant or variable that matches the identical value and then run those down. I frequently make a table of constant values along with the names used in both samples of code.
Automatic Flow Chart creators may be useful in some cases, but I have never found them to be helpful in looking for similar code. However, flowcharts can be an excellent visual aid when showing the similarities to non-technical people.
In the end, it will come down to looking at lines of code side by side. If you have a minimal amount of software to examine (a single file, for instance), a visual comparison may be all you need. For large samples of software, the tips contained in this article will let you know where to start looking.
For more on this topic, please see Part III – “What to Expect if You Become an Expert Witness in a Software Misappropriation Case”