WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan...

44
Bachelor Thesis Abdul Sabor Bostan Winnowing Algorithm for Program Code 16 July 2017 supervised by: Prof. Dr. Schupp Hamburg University of Technology (TUHH) Technische Universität Hamburg-Harburg Institute for Software Systems 21073 Hamburg

Transcript of WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan...

Page 1: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

Bachelor Thesis

Abdul Sabor Bostan

Winnowing Algorithm for Program Code

16 July 2017

supervised by:Prof. Dr. Schupp

Hamburg University of Technology (TUHH)Technische Universität Hamburg-HarburgInstitute for Software Systems21073 Hamburg

Page 2: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)
Page 3: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

Eidesstattliche Erklärung

Ich versichere an Eides statt, dass ich die vorliegende Arbeit selbstständig verfasst undkeine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe. Die Arbeitwurde in dieser oder ähnlicher Form noch keiner Prüfungskommission vorgelegt.

Hamburg, den 16.07.2017Abdul Bostan

iii

Page 4: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)
Page 5: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

Contents

Contents

1. Introduction 1

2. Background 32.1. Typical Plagiarism Attacks . . . . . . . . . . . . . . . . . . . . . . . . . 32.2. Desirable Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3. Rabin-Karp Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4. All-To-All Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.5. Java Bytecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3. Winnowing Algorithm 93.1. Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2. Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3. Efficient Implementation of Winnowing . . . . . . . . . . . . . . . . . . . 103.4. Expected Density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4. Implementation 154.1. Requirements and Restrictions . . . . . . . . . . . . . . . . . . . . . . . 154.2. Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3. Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3.1. Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3.2. Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3.3. K-Gram and Hashing . . . . . . . . . . . . . . . . . . . . . . . . 194.3.4. Build Windows and Select Fingerprints . . . . . . . . . . . . . . . 204.3.5. Compare Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . 214.3.6. Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4. Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5. Experiments 235.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2. Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3. Test Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.3.1. First Test Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3.2. Second Test Run . . . . . . . . . . . . . . . . . . . . . . . . . . 255.3.3. Third Test Run . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.4. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6. Conclusion and Future Work 29

A. Functions 33

B. Results 35B.1. Bytecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

v

Page 6: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

Contents

B.2. Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

vi

Page 7: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

List of Figures

List of Figures

2.1. K-Gram example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2. All-To-All matching example . . . . . . . . . . . . . . . . . . . . . . . . 62.3. Structure of Java class file . . . . . . . . . . . . . . . . . . . . . . . . . 72.4. Java Hello World example . . . . . . . . . . . . . . . . . . . . . . . . . 82.5. Java bytecode Hello World example . . . . . . . . . . . . . . . . . . . . 8

3.1. Winnowing algorithm example . . . . . . . . . . . . . . . . . . . . . . . 103.2. Code for winnowing main loop . . . . . . . . . . . . . . . . . . . . . . . 11

4.1. Structure of the copy-detection tool . . . . . . . . . . . . . . . . . . . . 164.2. Workflow of the implementation . . . . . . . . . . . . . . . . . . . . . . 164.3. Source code tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4. Bytecode tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.5. Class KGram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.6. Code snippet for building windows and selecting fingerprints . . . . . . . 204.7. Fingerprint comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.8. Found matches example . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

A.1. Functions for getting minimum value and corresponding index in an array 33

vii

Page 8: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)
Page 9: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

1. Introduction

Plagiarism is not new and has always been a problem in society, particularly in the aca-demic area, but through digitalization and the widespread use of computers, copyingdocuments has become very easy. Nowadays, all kinds of documents are available asdigital content and freely accessible either legally or illegally.This development makes it easy for culprits to copy digital content found on the web

without much effort. Therefore, plagiarism can be found in any field such as literature,design, scientific papers, and source code.

This thesis will focus on the detection of a particular type of plagiarism: source codeplagiarism, also known as software plagiarism. Source code plagiarism is often done bycomputer science students. Often, they solve their assignments by copying from otherstudents or copying it from the internet and presenting it as their own work.Due to this, universities are more concerned about this development and have worked

on tools to detect plagiarism. Through this effort, copy-detection tools were invented andare used in academia. The developed tools search for plagiarism, either by comparing aset of documents and determining if similarities exist between them, or by searching onthe internet or another database for plagiarism matches.For example, the copy-detection tool MOSS [6], which is based on the winnowing al-

gorithm [6], detects plagiarism in documents and source code. Another famous tool isJPlag [1], which was developed by the Karlsruhe Institute of Technology to find softwaresimilarities by comparing multiple sets of source code. JPlag detects disallowed copyingof student exercises without comparing to the internet.

The disallowed usage of others’ source code without crediting the original author is notonly a problem in universities but also in economics. Companies could suffer big economiclosses if third parties would be able to obtain the source code of their products and thenmake profit by re-selling their products after modifying their user interfaces. However,in this case, usually a copy-detection tool like JPlag and MOSS is not helpful becausesource code of companies is normally confidential.Therefore, a new solution is needed to detect plagiarism if the source code is not avail-

able for a suspicious program. An approach is to compare the binary code of a softwareto detect plagiarism. Here, source code is not needed and similarities between binarycodes are detected. As a result of this, the two goals of this thesis will be to developa plagiarism checker for source code and then for binary code, where the focus lies onthe Java language, so that the developed tool should be able to find similarities betweena set of Java source files and between a set of Java class files, which contain the bytecode.

The tool will be based on the fingerprint algorithm winnowing [6], which maps an arbi-trarily large file to hashes. These hashes uniquely identify the original file just as peopleare uniquely identified by human fingerprints.

1

Page 10: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

1. Introduction

After the development of the tool, the analysis of the following research questionsis important:

1. Is the copy-detection tool detecting all plagiarism cases in source code andbytecode?The results of this question should answer what kind of similarities in source codeand bytecode are detected by the winnowing algorithm and show which plagiarismattacks have a high probability of not being found.

2. Is there a relation between found matches in source code and bytecode?This research question focuses on the aspect if for each match detected in thebytecode a corresponding match can be found in source code or vice versa.

The goal is to have a copy-detection tool which is able to detect all similarities in Javasource code and bytecode. Additionally, finding matches in one type of code should implythat the same matches are also detected in the other type of code. Therefore, if thesource code is not available, plagiarism should also be found by only comparing binarycode.An important note is that tools like JPlag, MOSS or this thesis’ implementation only

show similarities. The found similarities never imply that plagiarism is found. The deci-sion to declare found matches as plagiarism must be done by the user after analyzing thefound matches.

The thesis is structured into six chapters. First, the prerequisites for understanding copy-detection algorithms will be explained in Chapter 2, which includes a listing of often usedplagiarism attacks, desirable properties for a plagiarism checker, two examples of finger-print algorithms, and a brief introduction to Java bytecode.Chapter 3 is dedicated to the winnowing algorithm on which our implementation is

based. The sequence of actions of winnowing are shown using an example and then anefficient implementation of the algorithm is presented. After that, the term expecteddensity is introduced, which is important for the evaluation of the experiments.In Chapter 4, the implementation of the copy-detection tool is presented. The require-

ments and restrictions are explained and the structure of the tool is given. Then in Section4.3, the workflow is explained step by step for both types of code. Chapter 5 details theresults of several test runs. First, the goals, which are derived from the research questions,are defined and then the sample data is described. After that, the results of the threetest runs are shown and finally, there will be an evaluation of the test results. Finally,Chapter 6 contains the conclusion of this thesis and further improvements.

2

Page 11: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

2. Background

This chapter describes the prerequisites required for understanding copy detection insource code and binary code. Also, plagiarism attacks and desirable properties will beintroduced that a copy-detection algorithm needs to satisfy to be robust against typi-cal plagiarism attacks. Then, the algorithms Karp-Rabin and All-to-All matching aredescribed and analyzed to show their weaknesses. In the end, a brief introduction andexplanation of Java bytecode will be provided, because knowing the basics of bytecode isnecessary to understand the implementation in Chapter 4.

2.1. Typical Plagiarism AttacksProgram code plagiarism is the act of using or modifying someone else’s code withoutauthorization and representing it as one’s own work, by not crediting the original author.Typical plagiarism attacks are the following [2]:

1. Copying word for word

2. Changing comments and whitespace

3. Renaming variable identifiers

4. Reordering the positions of code blocks and statements

5. Adding redundant or dummy statements and variables

6. Changing data types

7. Replacing control structures with equivalent structures

The first three methods are very simple plagiarism attacks and do not need large amountsof time. Also, it is not necessary for the attacker to understand the code because nofunctionality is changed by this kind of attacks.Methods 4–5 are typical attacks used by culprits, who understand the basics of the im-

plementation language. Methods 1–5 are more low-level attacks, which can be detectedby most copy-detection algorithms. Methods 6–7 are higher-level plagiarism attacks com-mitted by an attacker with understanding of the source code structure. These attacksare not easily detected by copy-detection algorithms.

2.2. Desirable PropertiesA plagiarism detector should be robust against the attacks of Methods 1–5. Therefore,we define three desirable properties that a copy-detection algorithm should have [6]:

3

Page 12: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

2. Background

• Insensitivity:The matches in program code need to be unaffected by whitespace, capitalization,comments, and punctuation. Furthermore, matches should also be insensitive tothe renaming of variable identifiers.

• Noise suppression:The discovered matches must be large enough to imply that the code block hasbeen copied. Hence, it is uninteresting for a copy-detection algorithm to find shortmatches, such as a try-catch block in two different Java files.

• Position independence:The set of matches found needs to be unaffected by adding parts to the programcode and removing parts of it. Shuffling the order of blocks in program code shouldalso not change the set of discovered matches.

The first property is handled the same way by most copy-detection algorithms. The inputwill run through a tokenization step, removing all undesirable differences between programcodes, e.g. comments, whitespace, and different variable names.Handling noise suppression can be done by using k-grams, which are used for copy-

detection algorithms based on fingerprints. The definition of a k-gram is [6]:

Definition 1 (K-GRAM). K-Gram is a contiguous substring of length k from a givenstring, where the parameter k is chosen by the user.

ThisIsAKGram.(a) String

ThisI hisIs isIsA sIsAK IsAKG sAKGr AKGra KGram(b) 5-Grams derived from the string

Figure 2.1.: K-Gram example

In Figure 2.1, an example of the derivation of k-grams from a string is shown. We definethe length of the string as l. The number of k-grams derived from a string is l − k + 1.For Figure 2.1, we get eight k-grams (Note: 12− 5 + 1 = 8).

The reason why k-grams are used to satisfy the noise suppression property is, that thereexists a threshold k, where the length of all interesting matches is not smaller than k andall matches less than k are usually not of interest, because these matches are commonidioms, keywords, or identifiers of the chosen implementation language.The third property is the most interesting requirement, which is difficult to satisfy

because of the effect that removing or adding parts will usually have on the set of matches.Thus, a copy-detection algorithm needs to minimize the influence of removing, adding,and shuffling parts. This can also be done by the usage of k-grams for fingerprint basedalgorithms.

4

Page 13: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

2.3. Rabin-Karp Algorithm

2.3. Rabin-Karp AlgorithmThe Rabin-Karp algorithm for fast substring matching [3], created by Michael O. Rabinand Richard M. Karp in 1987, is one of the first fingerprinting algorithms that uses k-grams. The algorithm tries to find occurrences of a string str1 with length k in a stringstr2. The length of str2 must be larger than k. Therefore, the algorithm derives allk-grams from str2 and hashes them and str1. Then, the hashes of all derived k-gramsfrom str2 are compared with the hash of str1. The purpose of hashing is to uniquelyidentify a string with a number. Then, it is possible to compare numbers instead ofcomparing strings.The cost of hashing strings of length k is expensive for large k, since hash values for all

substrings of size k of str2 need to be calculated. Therefore, a rolling hash function issuggested, which has the ability to calculate the hash values without rehashing the wholestring. The rolling hash function makes it possible to efficiently compute the current hashof the ith k-gram from the preceding hash of the i− 1th k-gram in constant time.

A k-gram c1 . . . ck is handled as a k-digit number in some base b with the followinghash for H(c1 . . . ck) [6]:

H(c1 . . . ck) = c1 · bk-1 + c2 · bk-2 · . . . + ck−1 · b + ck,

and the hash for H(c2 . . . ck+1) is defined as [6]:

H(c2 . . . ck+1) = (H(c1 . . . ck) - c1 · bk-1) · b + ck+1

Because bk-1 is a constant, H(c2 . . . ck+1) can be computed from H(c1 . . . ck) with onlytwo additions and two multiplications. A weakness of this algorithm is that the values ofci are usually small integers and therefore, the addition step is only affecting low-orderbits of the hash. Ideally, a good hash function for ci should affect all bits of the hash.This problem can be fixed by two additional steps [6]. First, by multiplication of the entirehash with b and second by switching in the incremental step the order of multiplicationand addition.Altogether, H ’(c2 . . . ck+1) is defined as [6]:

H ’(c2 . . . ck+1) = ((H ’(c1 . . . ck) - c1 · bk) + ck+1) · b

2.4. All-To-All MatchingAll-to-all matching [4] was developed by Manber and is the first fingerprinting-basedalgorithm for collections of documents. The algorithm is used to find similar files in filesystems. Unlike the Rabin-Karp algorithm, it compares all k-grams in the collection ofdocuments instead of searching for a single string in a document.Figure 2.2 shows the individual steps of the algorithm. After removing all irrelevant

features in Figure 2.2(a), we get Figure 2.2(b). Figure 2.2(c) shows all derived k-gramsfrom Figure 2.2(b) and in Figure 2.2(d), the hashes of all k-grams are shown.

5

Page 14: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

2. Background

This is an example of all-to-all matching(a) Sample text

thisisanexampleforalltoallmatching(b) Canonical form of the text

thisis hisisa isisan sisane isanex sanexa anexam nexamp exampl xample amplefmplefo plefor lefora eforal forall orallt rallto alltoa lltoal ltoall toallm oallma allmat

llmatc lmatch matchi atchin tching(c) 6-Grams derived from the text

77 74 42 17 98 50 17 98 08 88 67 39 77 74 42 17 98 2933 12 66 45 34 98 32 63 24 53 75(d) Hypothetical hashes of the 6-grams

Figure 2.2.: All-To-All matching example

Now, a theoretically possible approach is to compare all hashes from the documents, butin reality, the computational costs are too high, because the set of hashes can becomevery large for bigger documents. Therefore, it is necessary to find a subset of the hasheswhich should represent each document.An approach is to select every ith hash of a document as a fingerprint. However,

this strategy does not satisfy the desirable property of position independence. Insertion,deletion, and reordering would have a big effect on the set of matches. For example,removing one character at the beginning of the document shifts the position of all k-grams by one. As a result, the original document shares none of its fingerprints withthe copied and modified file. Thus, this strategy is incorrect, because any effective copy-detection algorithm based on fingerprints needs to be independent of the position of thefingerprints within the document.Another strategy, used by Manber, is to select hashes that are 0 mod p. In this case, the

desirable property of position independence is satisfied because the chosen fingerprints areindependent of their position within the document. Thus, if two documents share a hashthat is 0 mod p, it is selected regardless of the position. A disadvantage of this strategyis the missing guarantee of the algorithm to find matches. Define the distance betweenconsecutively selected fingerprints as the gap between them. The maximum gap is alwaysp and any matches inside a gap will not be detected, because we only select fingerprintsthat are 0 mod p. To solve this problem, Chapter 3 will introduce the winnowing algorithm,an efficient algorithm for selecting fingerprints. The main advantage of winnowing is thatthe maximum gap between selected consecutive fingerprints is limited.

6

Page 15: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

2.5. Java Bytecode

Figure 2.3.: Structure of Java class file

2.5. Java Bytecode

Java bytecode is the instruction set of the Java Virtual Machine (JVM) operating onthe operand stack. There are 203 instructions, which each consist of one opcode and itsoperand. Java bytecode is generated from a class file and then runs on the JVM by aJava compiler.The class file consists of the sections header, constant pool, field, method table andattribute table.The first important information in the header is the magic number, which identifies the

class file format. After that, the version information and constant pool is located, whichincludes all information about the used constants in the class. After that, the field andmethod tables are located. The field and method tables contain the information aboutvariables and methods. Finally, located at the end of the class file, is the attribute table,which can be used to debug the Java program by JVM.The structure of a Java class file is shown in Figure 2.3, which is taken from a paper

by Jeong-Hoon Ji [2]. The actual bytecode is in the code attribute table in the methodtable. For every user-defined method in the source code, a method table is generatedalong with a default constructor in the class file.For instance, Figure 2.4 contains a Java class HelloWorld. The class HelloWorld consists

only of a main method. As a result, the class file HelloWorld.class will contain two methodtables, a default constructor and the main method.In Figure 2.5, bytecode is shown for the Java class HelloWorld. The bytecode containstwo methods: one is the main method; the other is the default constructor, which isinferred by the compiler.Below each method, there is a sequence of bytecode instructions. For every method in

a class file, a corresponding bytecode array exist. The numbers in front of each instruc-

7

Page 16: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

2. Background

1 public class HelloWorld2 {3 public static void main(String[] args)4 {5 // Output6 System.out.println("Hello World");7 }8 }

Figure 2.4.: Java Hello World example

1 //Compiled from "HelloWorld.java"2 public class HelloWorld extends java.lang.Object{3 public HelloWorld();4 Code:5 0: aload_06 1: invokespecial #1; //Method java/lang/Object."<init>":()V7 4: return8

9 public static void main(java.lang.String[]);10 Code:11 0: getstatic #2; //Field java/lang/System.out:Ljava/io/PrintStream;12 3: ldc #3; //String Hello World13 5: invokevirtual #4; //Method

java/io/PrintStream.println:(Ljava/lang/String;)V14 8: return15 }

Figure 2.5.: Java bytecode Hello World example

tion refer to the index of the array where each opcode and its parameter are stored. Anopcode is one byte long and instructions can have multiple parameters. Therefore, thenumbers are not consecutive. Finally, the numbers after the instructions are referring tothe corresponding number in the constant pool.The bytecode for each method stored in the method tables is important for our imple-mentation, which will be shown in Chapter 4. This code will be used to find similaritiesbetween class files.

8

Page 17: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

3. Winnowing Algorithm

In this chapter, we describe the winnowing algorithm for selecting fingerprints from hashesof k-grams. The important steps of the algorithm will be explained with an example andcode for an efficient implementation of winnowing will be presented. Finally, we provethe expected density of winnowing [7].

3.1. ParameterThe winnowing algorithm is used by the copy-detection tool MOSS and was introducedby Saul Schleimer in 2003 [6]. The algorithm selects fingerprints for each document andcompares them. The following two properties should be satisfied given a set of documentsthat are to be compared:

1. A match is detected only if the length of the match is at least as long as theguarantee threshold t.

2. Any matches shorter than the noise threshold k are not detected.

The constants t and k are chosen by the user, where k ≤ t. The second property is satis-fied by only considering hashes of k-grams. Thus, we avoid any matches shorter than k. Abigger k increases the probability that found matches between documents are not coinci-dental. However, we cannot detect possible similarities of length smaller than k, becausethe increase of k limits the sensitivity to reordering of substrings. As a result, we needto find a minimum value for k, which eliminates as many coincidental matches as possible.

3.2. ExampleFigure 3.1 shows the important steps of the winnowing algorithm. First, all irrelevantfeatures will be removed from Figure 3.1(a). This results in Figure 3.1(b), which doesnot contain any whitespace, punctuation, or capitalized characters. After that, all the5-grams of the string of Figure 3.1(b) are shown in Figure 3.1(c). Figure 3.1(d) containsthe hashes of all 5-grams from Figure 3.1(c). Afterward, windows are built with thehashes in Figure 3.1(e). In this example the length is 4 because the window length shouldbe smaller than our value for k. Fingerprints will be selected from the windows using thefollowing strategy [6]:

Strategy 1 (WINNOWING). In every window, the minimum hash value is selected. Everyselected hash value is saved as a fingerprint. If a window has more than one minimumvalue, select the rightmost occurrence. Every hash can only be selected once. If theminimum value of a window was selected before as a fingerprint, no hash value will beselected in this window.

9

Page 18: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

3. Winnowing Algorithm

Hello! I am Fo and this is Fa.(a) Sample text

helloiamfoandthisisfa(b) Canonical form of the text

hello elloi lloia loiam oiamf iamfo amfoa mfoan foandoandt andth ndthi dthis thisi hisis isisf sisfa

(c) 5-Grams derived from the text

77 74 42 17 98 50 15 98 08 88 67 39 77 74 42 17 98(d) Hypothetical hashes of the 4-grams

(77, 74, 42, 17) (74, 42, 17, 98) (42, 17, 98, 50)(17, 98, 50, 15) (98, 50, 15, 98) (50, 15, 98, 08)(15, 98, 08, 88) (98, 08, 88, 67) (08, 88, 67, 39)(88, 67, 39, 77) (67, 39, 77, 74) (39, 77, 74, 42)

(77, 74, 42, 17) (74, 42, 17, 98)(e) Windows of hashes of length 4

17 15 08 39 17(f) Fingerprints selected by winnowing

[17,3] [15,6] [08,8] [39,11] [17,15](g) Fingerprints paired with 0-base positional information.

Figure 3.1.: Winnowing algorithm example

Each selected hash is shown in bold in Figure 3.1(e). A selected hash is with high prob-ability still the minimum value in adjacent windows, because the minimum of w randomhashes is presumably smaller than one additional hash. That is the reason for choosingthe minimum value of a window.Thus, the amount of selected hashes is far smaller than the amount of windows while

it is still selecting representing hashes. The set of selected fingerprints is shown in Figure3.1(f). A useful feature to implement in copy-detection tools is recording the positionof fingerprints in the document to show the matching substring in a user interface. InFigure 3.1(g) this is shown for this chapter’s example.

3.3. Efficient Implementation of Winnowing

Figure 3.2 depicts code for an efficient implementation of the main winnowing loop. Thiscode was provided in the paper written by Saul Schleimer [6]. The implementation in

10

Page 19: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

3.3. Efficient Implementation of Winnowing

1 void winnow(int w) { /*window size*/2 // circular buffer implementing window of size w3 hash_t h[w];4 for (int i=0; i<w; ++i) h[i] = INT_MAX;5 int r = 0; // window right end6 int min = 0; // index of minimum hash7 // At the end of each iteration, min holds the8 // position of the rightmost minimal hash in the9 // current window. record(x) is called only the

10 // first time an instance of x is selected as the11 // rightmost minimal hash of a window.12 while (true) {13 r = (r + 1) % w; // shift the window by one14 h[r] = next_hash(); // and add one new hash15 if (min == r) {16 // The previous minimum is no longer in this17 // window. Scan h leftward starting from r18 // for the rightmost minimal hash. Note: min19 // starts with the index of the rightmost20 // hash.21 for (int i=(r-1)%w; i!=r; i=(i-1+w)%w)22 if (h[i] < h[min]) min = i;23 record(h[min], global_pos(min, r, w));24 } else {25 // Otherwise, the previous minimum is still in26 // this window. Compare against the new value27 // and update min if necessary.28 if (h[r] <= h[min]) {29 min = r;30 record(h[min], global_pos(min, r, w));31 }32 }33 }34 }

Figure 3.2.: Code for winnowing main loop

Figure 3.2 considers the fact that the minimum value from a preceding window is usuallystill within the current window. Therefore, Figure 3.2 only contains a single comparisonfor this case in line 28. The other case in line 15 is that the minimum value of thepreceding window is no longer in the current window. In that case, the minimum hashhas to be computed by traversing the entire window from right-to-left to get the rightmostoccurrence of the minimum hash. A fingerprint is created by saving the position, togetherwith the selected hash in line 23.

11

Page 20: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

3. Winnowing Algorithm

3.4. Expected DensityThe density of a fingerprinting algorithm is the expected proportion of hashes selectedas fingerprints. The density of the winnowing algorithm is associated with w, the size ofwindows. Saul Schleimers paper [6] showed that the expected density is: d = 2

w+1The All-to-All matching algorithm with the 0 mod p approach for hash selection, which

was presented in Section 2.4, has 1+ln(w)w as expected density [6]. This is considerably

more than that of the winnowing algorithm.

Lemma 1. The expected density of winnowing is d = 2w+1 , given that the hashes are

independently and uniformly distributed and the possibility of a tie for the minimum valuefor any small window is small enough to be ignored.

Proof. Let us define the function C that maps the position of a window (position ofleftmost hash) to the position of a fingerprint which the window had selected. Thefunction C is monotonic increasing, if i and j are two selected fingerprints positions withi < j, then C(i) < C(j).To prove this assumption, two cases must be considered [7]

1. Windows Wi and Wj are not overlapping.Then the position of any hash in Wj is bigger than C(i). Thus, we have C(i) <C(j).

2. Windows Wi = (hi, hi+1, · · · , hj , · · · , hi+w−1) and Wj = (hj , hj+1, · · · , hi+w−1, · · · ,hj+w−1), where i < j ≤ i + w − 1, are overlapping.Then, the maximum value of C(i) is q, where q is the position of minimum hashamong hj , · · · , hi+w−1. Similarly, the minimum value of C(j) is also q because ofthe overlapping windows. Thus, we have C(i) ≤ C(j).

These two cases show that the function C is monotonic increasing. Now, we define arandom variable Xi that is 1 if Wi selects a fingerprint which is not selected by anyprevious window. Consider Wi and Wi−1 which are overlapping on all hashes except theleftmost hi−1 and rightmost hi+w−1 hash. Wi ∪ Wi−1 is an interval of length w + 1.

Then we define p containing the smallest hash in the interval. As a result, we have thefollowing three cases:

1. If p = i − 1, then Wi must select a hash in a different position q, where q > pbecause Wi−1 is selecting p and p /∈ Wi. Because of the monotonicity of functionC, Wi−1 is the first window to select q. Therefore, we have Xi = 1.

2. If p = i + w − 1, then it will be selected by Wi because Wi is the first windowcontaining p. Thus, Xi = 1.

3. If i − 1 < p < i + w − 1, then Wi−1 and Wi select p. Thus, Wi is not the firstwindow to collect p and Xi = 0.

For case one and two the probability is respectively 1w+1 . As a result, we obtain 2

w+1 asthe expected value for our random variable Xi.

12

Page 21: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

3.4. Expected Density

The expected density allows programmers to compare the measured density of their im-plementation. If there is a big difference between the measured and expected density, theprogrammer should revise his implementation, hash function or test data. In Chapter 5,we will use the expected density to evaluate the test runs.

13

Page 22: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)
Page 23: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

4. Implementation

This chapter presents decisions made regarding the structure and implementation for ourcopy-detection tool. First, the different requirements for Java source code and bytecodewill be named and then the structure of our implemented tool is presented. Finally, theworkflow of this implementation will be shown and explained.Our implementation handles source code and bytecode differently in some steps. These

differences will be shown in detail by explaining every step of the workflow. Our motivationis to have a copy-detection tool based on winnowing with the ability to find similaritiesbetween a set of source code files or binary code files.An important fact about this tool is that found matches do not imply that plagiarism

has occurred. They only imply that similarities exist and then it is up to the user to checkmanually, if the found matches represent cases of plagiarism.

4.1. Requirements and Restrictions

Our implementation is based on winnowing and hence it must satisfy the desirable prop-erties presented in Chapter 2. Recall the desirable properties: insensitivity to irrelevantfeatures like comments and whitespace, independence of the found matches to the posi-tions in the text, and finding only matches larger than a defined noise threshold.As mentioned in Chapter 3, a useful feature for a copy-detection tool is a user interface

which shows the found matches. This feature is also one of our requirements because inthe end the user manually checks if plagiarism was detected, and therefore a simple way isneeded to see the found matches. Another requirement is that the comparison of sourcecode and bytecode should be considered independently, with the only exception beingthat we compile the source code to get the corresponding bytecode. Thus, we define therestriction that source code files must be compilable to generate bytecode. Furthermore,the Java code needs to be restricted to the Java Standard Library, due to the tokenizationstep which will be explained later in detail.

4.2. Structure

The requirement of independence between source code and bytecode has an influenceon the decisions made regarding the overall structure. This implementation has twoindependent comparison analyses: one for source code, the other for bytecode. Both savethe results separately, and then the results are analyzed to decide if plagiarism was found.The described course of action is pictured in Figure 4.1. In the representation, we

have two independent objects named Source Code and Bytecode, which run through theprocess of comparison. The comparison process is defined by our workflow that will bepresented in the next section. After matches are found, the copy-detection tool generatesoutput with the detected similarities for the user. Still, we distinguish between sourcecode and bytecode. Thus, we obtain independent outputs, which are both analyzed by

15

Page 24: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

4. Implementation

the user to make a decision on plagiarism. At this point, there is no longer a distinctionbetween the type of code as the user uses information both from the bytecode and thesource code comparison to make a decision.

Figure 4.1.: Structure of the copy-detection tool

4.3. Workflow

The workflow is defined by eight steps, shown in Figure 4.1. These are relatively similarto the individual steps of the winnowing example shown in Figure 3.1. First, the inputis read, which includes the program code files and the parameter w for window size andk for the k-gram length. Different parameters are defined for source code and bytecode.

The winnowing algorithm for source code has wsc and ksc as parameters while thebytecode has parameters wbc and kbc where ksc < kbc, because a single Java source codeline normally generates multiple bytecode instructions as can be seen in Figure 2.5. Then,the program code files are tokenized and k-grams are derived from them. With the hashvalues of all k-grams, windows are built and the fingerprints are selected after that. Theselected hashes from the files get compared and output in form of HTML pages is thengenerated.The steps of handling program code in general, are shown in Figure 4.2. The more

interesting aspects are the differences of handling source code and bytecode. These willbe presented in the following in detail for the individual steps.

Figure 4.2.: Workflow of the implementation

16

Page 25: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

4.3. Workflow

4.3.1. Input

First, the parameters wbc, kbc, wsc, and ksc are read. The k-gram length for source codeneeds to be at least 25 because most of the widely used Java keywords and identifier inthe Java Standard Library are smaller than 25. This value was chosen as a minimum valueto reduce coincidental matches. Therefore, the tool throws an exception if kbc ≤ ksc orksc ≤ 25. The window length should not be bigger than the k-gram length and for smallvalues for wsc and wbc our implementation would select too many hashes as a fingerprint.As a result, the false-positive rate would increase. Hence, an exception appears if wbc < 15or wsc < 15. Next, from the input directory, where all source code files are saved thatare to be compared, the Java files get compiled. Through this, the corresponding classfiles are obtained and the bytecode can be extracted. This step has the only dependencybetween source code and bytecode. Without compiling the Java files, bytecode cannotbe obtained. Therefore, compilable source code is essential as a requirement.

4.3.2. Tokenization

The second step is responsible for removing irrelevant features from the input. Here, wedefined different rules for source code and bytecode tokenization. Source code tokeniza-tion has the following rules:

• Rename all variables identifiers to Y

• Change string values to X

• Delete whitespace, comments and punctuation

1 public class HelloWorld2 {3 public static void main(String[] args)4 {5 // Output6 System.out.println("Hello World");7 }8 }

Result after source code tokenization of class HelloWorld :

publicclassYpublicstaticvoidmainstring[]YsystemoutprintlnX

Figure 4.3.: Source code tokenization

In Figure 4.3, the rules are applied to the source code example. Initially, variablesare renamed and string values exchanged. The difficulty of renaming variables lies in

17

Page 26: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

4. Implementation

distinguishing the user-defined variable identifiers from basic function identifiers, for ex-ample System.out.println. Renaming all identifiers would have negative consequences,such as the false-positive error rate increasing. Therefore, a file with all class and func-tion keywords from the Java Standard Library is created to check if an identifier belongsto the JSL. Thus, Java files need to be restricted to the Java Standard Library. Re-named variables and changed string values are printed in bold in Figure 4.3. For bytecodetokenization the following rules were defined:

• Delete comments, whitespace and the number in front of instructions that refers tothe index of the corresponding array (see Section 2.5)

• Replace indexes of the constant pool with their value

• Rename variables to Y

• Change string values to X

1 //Compiled from "HelloWorld.java"2 public class HelloWorld extends java.lang.Object{3 public HelloWorld();4 Code:5 0: aload_06 1: invokespecial #1; //Method java/lang/Object."<init>":()V7 4: return8

9 public static void main(java.lang.String[]);10 Code:11 0: getstatic #2; //Field java/lang/System.out:Ljava/io/PrintStream;12 3: ldc #3; //String Hello World13 5: invokevirtual #4; //Method

java/io/PrintStream.println:(Ljava/lang/String;)V14 8: return15 }

Result after bytecode tokenization of class HelloWorld :

publicclassYpublicY()aloadinvokespecialobject.<init>:()Vreturnpublicstatic voidmain(String[])getstaticSystem.out:PrintStreamldcXinvokevirtualprintln:(String)Vreturn

Figure 4.4.: Bytecode tokenization

Applying these tokenization rules takes more effort. Bytecode contains more irrelevantfeatures and information that can be removed. The method information, the sequence ofinstructions and the corresponding parameter from the constant pool are important, aswe can see in Figure 4.4.

18

Page 27: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

4.3. Workflow

In the example in Figure 4.4, all irrelevant features are removed, variable identifiers andstring values changed, which are shown in bold, and the constant pool value replacesthe reference after an instruction. Similarly to replacing variable names in source codetokenization, it is checked if the identifier is part of the Java Standard Library. Therule is only applied if this is not the case. The remaining steps of the workflow do notdifferentiate between the types of code. They have the same sequence of actions forsource code and bytecode.

4.3.3. K-Gram and Hashing

1 // Class KGram saves the starting and ending position2 // of the k-gram in the program code file3 public class KGram {4 private final String str;5 private int hash;6 private final int start;7 private final int end;8 // Constructor9 public KGram(String str, int start, int end) {

10 this.str = str;11 //hashing the string12 this.hash = str.hashCode();13 this.start = start;14 this.end = end;15 }16 // Getter functions17 public String getStr() {18 return str;19 }20 public int getHash() {21 return hash;22 }23

24 public int getStart() {25 return start;26 }27 public int getEnd() {28 return end;29 }30 }

Figure 4.5.: Class KGram

Figure 4.5 shows the implemented KGram class, which is also directly performing thehashing step. For every k-gram a class is created, where the hash, k-gram value, start

19

Page 28: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

4. Implementation

and end position of it in the program code file is saved. The position information ismandatory so that we are able to map the k-gram to its location in the original file. Thus,this implementation can mark the found matches in the output for the user. Hashing thek-gram is done with the Java default hashcode function.

4.3.4. Build Windows and Select Fingerprints

1 /**2 * Code snippet to build windows and select fingerprints by3 * calling a function to get the minimum of an int array4 * hashList contains all k-gram hash values5 */6 // Variable saves index of the last selected fingerprint7 int minLocation;8 // Variable saves the index of minimum hash in the current window9 int location=0;

10 // Loop for defining the starting position of a window with variable x11 // x must be smaller than length of hashList-windowSize12 for (int x = 0; x < hashList.size() - windowSize; x++) {13 // Loop builds the window by taking the starting position x14 // and saves in every iteration the hash value in windowArray15 for (int j = 0, k = x; j < windowSize; j++, k++) {16 windowArray[j] = hashList.get(k);17

18 }19 // Calls a function to find the minimum hash20 location = x + windowClass.indexOfMin(windowArray);21 // Checks if minimum hash was selected before as a fingerprint22 // If not, select hash as a fingerprint23 if (location != minLocation) {24 // Saves hash as a fingerprint25 fingerprintList.add(getMinValue(windowArray));26 minLocation = location;27 }28 }

Figure 4.6.: Code snippet for building windows and selecting fingerprints

Figure 4.6 contains the code for building windows and selecting fingerprints. For that,it uses the functions shown in the Appendix Functions, which select the minimum valueand the corresponding index from an integer array. The course of actions in Figure 4.6is first to build a window, select the minimum value and save it as a fingerprint unlessit was not chosen before. Then, the next window is built and this sequence of actions isrepeated until the last window is reached.

20

Page 29: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

4.3. Workflow

4.3.5. Compare Fingerprints

Figure 4.7.: Fingerprint comparison

This step requires two sets of fingerprints of files that should be compared. Figure 4.7presents how the comparison of fingerprints is done. It is not necessary for the comparisonthat the sets have the same number of selected hashes. Each hash of a set is comparedwith each hash of the other set.

21

Page 30: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

4. Implementation

4.3.6. Output

For every compared pair of files, where at least one match occurred, an output is gen-erated. The output is in the form of an HTML page, which contains the content ofthe compared files, where matches are marked. As a result, the user can analyze foundmatches for source code or bytecode to decide if there is a plagiarism case. Figure 4.8shows an output example of found matches in two source code files, where matches aremarked red.

Figure 4.8.: Found matches example

4.4. Implementation DetailsIn the tokenization step, ANTLR (ANother Tool for Language Recognition) was used asa parser and lexer generator. A lexer and parser are used to remove irrelevant featuresfrom the files that should be compared. Therefore, a grammar was needed for the imple-mentation language of the files because ANTLR is generating the lexer and parser fromthe given grammar. The implemented copy-detection tool is limited to the comparisonof Java files. The grammar for Java was provided by ANTLR [5]. We used the ANTLRversion 3 and Java version 1.8.0 for the implementation. As an external library, only thecommons-io-2.5 library from Apache was used.

22

Page 31: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

5. Experiments

This chapter reports our experience with the implemented copy-detection tool of Chapter4 by testing the algorithm on 32 test files. First, four goals will be defined and then ourtest runs will be described. The results of our test runs will help us answer the researchquestions, presented in Chapter 1.

5.1. Goals

The goals for our experiments are derived from the research questions, shown in Chap-ter 1, and the assumption that using the winnowing algorithm will imply that all relevantsimilarities will be found and that plagiarism between files is traceable in source code andbytecode. The following goals are defined for the experiments:

1. Find all similarities in the program code, that are not smaller than the noise thresholdk.

2. Find the same number of matches for source code and bytecode.

3. For each match detected in the bytecode a corresponding match can be found insource code and vice versa.

4. The measured density equals the expected density.

These goals are important for the evaluation of the test runs. For every test run, it willbe described if this goals were satisfied. Then, we will have a look at the false-positiverate.

5.2. Test Set

For the test runs, 32 Java files are used. 21 of them were taken from the JPlag home-page [1]. JPlag is a copy-detection tool like MOSS but implemented with another algo-rithm than winnowing. The tool is widely used for plagiarism detection and therefore wedecided to take the free accessible test data provided on their homepage. These files onlyuse the Java Standard Library, which is an important requirement of this implementation.

Nevertheless, the test set should be bigger. The JPlag homepage only provided us 21files. Therefore, we added eleven other test files, which were programmed by us or takenfrom the internet. The source of every test file is added as a comment in the correspond-ing file. We used only the Java Standard Library and placed similarities between our ownprogrammed files. For the test runs, only the source code files are needed. The Javaclass files, which contain the bytecode, are generated with the compilation of the sourcecode files. All similarities between the 32 test files are known to us, which makes it easyto analyze the HTML pages to see if all plagiarism cases were found. The test files arein average 5300 non-whitespace characters long.

23

Page 32: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

5. Experiments

5.3. Test Run

Overall, three test runs were made. We used different values for the parameters wbc, kbc,wsc and ksc in the test runs in order to find optimal values for the variables by evaluatingif the test results satisfied the defined goals. Every file of the test set was compared to allother elements. So, 32 test files produce 496 comparisons for each bytecode and sourcecode. A found match is returned if similarities between two files exist. Furthermore, ourtool generates for each found match an HTML page where the similarities are marked,which need to be analyzed by the user. In total there are 102 matches. In Chapter 4,an explanation was given why the k-gram length for bytecode and source code needs tobe at least 25. We expect that most matches for values smaller than 25 are likely to becoincidental matches. Therefore, the first test run had the value 30 for ksc. For bytecodecomparison, strings of 200 characters were hashed for the first test run. The chosenparameters are based on the test results in the Appendix Results. They show that forkbc < 150 too many matches are found and for ksc > 75 too few matches are found.

5.3.1. First Test Run

For source code, strings of 30 characters were hashed and the window length was set to20. The length of a k-gram was set to 200 and windows of 60 hashes were built. Themeasured density for both was near the expected density, but nevertheless with a smalldifference, which was smaller for source code than for bytecode, as can be seen inTables 5.1 and 5.2.From 496 file comparisons, the algorithm found 206 matches for source code and 158

have been found for the bytecode. After analyzing the generated HTML pages of thefound matches, all plagiarisms were found. Still, the false-positive rate was high. For thesource code, about 104 found matches were irrelevant because the similarities were onlytypical Java language constructs like a try-catch block and therefore marked as a foundmatch. The false-positive rate was smaller for bytecode, where about 60 found matchesare irrelevant matches, however with 48 less found matches in total.

Table 5.1.: Source code (k=30, w=20)hashes computed 76998fingerprints selected 7330measured density 0.09519expected density 0.09523matches found 206false positives 104

24

Page 33: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

5.3. Test Run

Table 5.2.: Bytecode (k=200, w=60)hashes computed 333981fingerprints selected 10868measured density 0.0325expected density 0.0328matches found 158false positives 56

5.3.2. Second Test Run

Due to the high false-positive rate in the first test run, the variables were given largervalues. Therefore, bytecode comparison had windows of length 210 and 55 charactersin every k-gram. For source code, strings of 40 characters were hashed and the windowlength was set to 23, which is shown in Tables 5.3 and 5.4.For both types of code, the difference between measured and expected density increased

compared to the first test run. For source code, we now had a difference of 0.001 and0.0005 for bytecode. The number of found matches decreased for both. 175 matcheswere found for source code and 145 matches for bytecode. Similarly, the false-positiverate decreased. For source code, there were 30 false-positive found matches less and forbytecode 10 less. But a high number of false-positive cases for source code and bytecoderemained. However, all cases of plagiarism were found for both types of code.

Table 5.3.: Source code (k=40, w=23)hashes computed 76678fingerprints selected 6454measured density 0.084expected density 0.083matches found 175false positives 73

Table 5.4.: Bytecode (k=210, w=55)hashes computed 333661fingerprints selected 11751measured density 0.0352expected density 0.0357matches found 145false positives 43

25

Page 34: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

5. Experiments

5.3.3. Third Test Run

In the last test run, the value of k-gram and window length was once again increased,due to the high false-positive rate in the previous two experiments. wbc = 50, kbc = 205,wsc = 25, and ksc = 50 were chosen as parameters.Still, a small difference between expected and measured density existed in both types of

code, but this difference was about 0.0002, both for source code and bytecode. Further-more, found matches decreased for both and the number of found matches for bytecodehad surpassed the number of found matches for source code. Bytecode comparison found150 matches, while the source code comparison found 122 matches.Additionally, there are fewer false-positive cases, due to the larger values for the pa-

rameters. However, not all plagiarism cases were found for source code. Three matches,which were found in the last two test runs, have not been found in the current one, eventhough, they were real plagiarism cases and not only typical Java language constructssuch as a try-catch block.

Table 5.5.: Source code (k=50, w=25)hashes computed 76358fingerprints selected 5857measured density 0.0767expected density 0.0769matches found 122false positives 23

Table 5.6.: Bytecode (k=205, w=50)hashes computed 333821fingerprints selected 13027measured density 0.0390expected density 0.0392matches found 150false positives 48

5.4. Evaluation

The three test runs are evaluated by assessing how well the defined goals were met. Thefirst goal was met by all three test run results. All similarities in bytecode and source codewere found when their length was bigger than the noise threshold k. Nevertheless, findingan appropriate k was a problem. For big k values, it is less likely that coincidental matcheswill be found. Nonetheless, relevant matches shorter than this threshold would be ignored.The influence of the parameter choices is very big on the test results. Choosing large or

26

Page 35: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

5.4. Evaluation

small values for the parameters w or k can lead to completely different results. Thisis a problem of the winnowing algorithm in general and not only of our implementationthat the results of the algorithm depend too much on the right parameter choices. Also,there is no guideline for the user to choose the right values. Therefore, the parameterchoices depend mostly on the experience of the user as a programmer, which is a negativeaspect because less experienced users would have a problem to find the right values. Forexample, in the first two test runs, all relevant similarities were found but after increasingk from 40 to 50, three relevant matches were not found. Additionally, the false-positiverate was high. It decreased with increased values for the parameters but even so, manysimilarities were only Java language constructs.Furthermore, the number of found matches for source code and bytecode was not equal

in any test run. In the first two experiments, there were more found matches for sourcecode and in the third test run, more matches were found for bytecode. This is the resultof using different parameters for source code and bytecode due to the different syntaxand length of the instructions. Thus, the third goal was only partly reached. In the firsttwo test runs, the matches in bytecode were also found in source code and in the thirdexperiment all detected similarities in source code had a corresponding found similarity inbytecode.The last defined goal was achieved in every test run. The measured density was always

very close to the expected one, with small differences that can be ignored. Another aspectof the implemented copy-detection tool is the runtime. A test run with 32 files, whichincludes 496 comparisons for bytecode and 496 for source code, takes about 45 seconds.Still, improvement must be achieved, so that the copy-detection tool can be used forlarger test sets without taking too much time. The improvement will be discussed inChapter 6.Finally, not all goals were fulfilled but in the first two test runs all plagiarism cases

were found and the measured density was always close to the predicted one. The problemof finding appropriate values for the different parameters needs to be solved so that thesame matches are found for both types of code and their number is the same. An advisefor the right parameter choices in Java is 25 < ksc < 50, 15 < wsc < 30, 200 ≤ kbc,and 50 ≤ wbc. This advise is based on our experience with the test set presented inSection 5.2.

27

Page 36: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)
Page 37: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

6. Conclusion and Future Work

The main goal was to implement a copy-detection tool based on the winnowing algorithmfor source code and binary code and then to analyze and compare the performance of theimplementation on the different types of code.Using the winnowing algorithm guarantees that the desirable properties noise sup-

pression, insensitivity and position independence, presented in Chapter 2, are satisfied.Therefore, the expectation was to find all plagiarism attacks, except attacks where datatypes are changed or control structures are replaced with equivalent structures.During the implementation process, the main goal was more specifically described as

finding similarities for Java source code and Java bytecode with the implemented copy-detection tool because it was necessary to decide for which programming language thetool should detect plagiarism since a language’s grammar is needed in the tokenizationstep.It would be possible to extend the set of programming languages, for which the imple-

mentation is able to detect similarities. As of now, the set only includes Java. Neverthe-less, this would take too much effort as a language grammar would need to be defined,binary code and source code would have to be analyzed in order to define rules in thetokenization step.The analysis of the test results in Chapter 5 shows that the implementation finds every

similarity smaller than the defined noise threshold k in both types of code. However, theproblem is to find appropriate values for the parameters w and k, so that the numberof coincidental matches becomes as low as possible, all non-coincidental similarities aredetected and the measured density equals the expected density.Furthermore, the number of matches were not equal between source code and byte-

code, which could be caused by choosing the wrong values for the parameters. Finally,the answer to the research questions presented in Chapter 1 is that all types of plagiarismcan be detected with the implemented tool if the match is longer than our noise thresholdk except plagiarism attacks where data types are changed or where control structures arereplaced with similar structures. Also in the experiments, there was a relation betweenfound matches in source code and bytecode. Most detected similarities in bytecode hada corresponding found similarity in source code or vice versa.For future work, a few aspects remain open. First, the winnowing algorithm needs to

be implemented more efficiently, as partly shown in Chapter 3. Therefore, a solution isneeded such that the runtime and memory consumption decrease but at the same timeno functionality is lost. The reason for the current runtime is an inefficient hash function.This could be improved by using the Rabin-Karp algorithm. Furthermore, the implemen-tation always compares every hash in a window with every other instead of using the factthat the minimum value from a preceding window is usually within the current one suchthat in most cases a single comparison is sufficient to find the minimum hash.Large memory consumption is caused by the user-friendly output in form of an HTML

page, where all matches are marked. For this kind of output, it is necessary to save everyinformation during the tokenization of a file so that it is possible to give a k-gram the

29

Page 38: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

6. Conclusion and Future Work

starting and ending position of the corresponding non-transformed strings.

Another aspect of improvement is the grouping of Java bytecode instructions. Withthis, similar bytecode instructions are categorized into a group according to their func-tionality. For example, the instructions aload_0, aload_1, aload_2 and aload_3 aresimilar to aload, which loads a reference onto the stack from a local variable pool. Thedifference between these bytecodes is only whether it has an operand or not. In thisexample, the instructions should be grouped and all similar instructions in the class fileshould be replaced by the group label. Through this improvement, the efficiency shouldincrease for bytecode comparison.

Finally, by expanding the set of programming languages in which similarities shouldbe detected, the implementation would not be restricted to Java files. Furthermore, therestriction of the Java files to the Java Standard Library for the current implementationneeds to be removed.

30

Page 39: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

Bibliography

Bibliography

[1] Institute for Program Structures and Data Organization. What is JPlag, July 2017.Retrieved: July 13, 2017 from https://jplag.ipd.kit.edu/.

[2] Jeong-Hoon Ji, Gyun Woo, and Hwan-Gue Cho. A plagiarism detection techniquefor java program using bytecode analysis. In In Third International Conference onConvergence and Hybrid Information Technology, volume 1, pages 1092–1098. IEEE,2008.

[3] Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algo-rithms. IBM Journal of Research and Development, 31(2):249–260, March 1987.

[4] Udi Manber. Finding similar files in a large file system. In USENIX Winter 1994Technical Conference, pages 1–10, 1994.

[5] Terence Parr. The definitive ANTLR 4 reference. Pragmatic Bookshelf, 2013.

[6] Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. Winnowing: Local algorithmsfor document fingerprinting. In Proc. 2003 ACM SIGMOD International Conferenceon Management of Data, pages 76–85, San Diego, CA, June 2003.

[7] Xiaoming Yu, Yue Liu, and Hongbo Xu. Density analysis of winnowing on non-uniformdistributions. In Advances in Data and Web Management, Joint 9th Asia-Pacific WebConference, APWeb 2007, and 8th International Conference on Web-Age InformationManagement, WAIM 2007, Huang Shan, China, pages 586–593, 2007.

31

Page 40: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)
Page 41: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

A. Functions

1 /**2 * Function to return index of min value in an int array3 * @param a4 * @return5 */6 public static int indexOfMin(int[] a)7

8 {9 // Variable for the index of minimum value

10 int loc = 0;11 // Variable to save current minimum value12 int min = a[0];13 // Iterate through array14 for (int i = 1; i < a.length; i++) {15 // Checks if current array value is smaller or equal to16 // the last saved minimum value17 if (a[i] <= min) {18 min = a[i];19 loc = i;20 }21 }22 return loc;23 }24

25 /**26 * Function to return min value of an int array27 * @param numbers28 * @return29 */30 public static int getMinValue(int[] numbers) {31 int minValue = numbers[0];32 for (int i = 1; i < numbers.length; i++) {33 if (numbers[i] <= minValue) {34 minValue = numbers[i];35 }36 }37 return minValue;38 }

Figure A.1.: Functions for getting minimum value and corresponding index in an array

33

Page 42: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)
Page 43: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

B. Results

B.1. Bytecode

Table B.1.: Bytecode (k=25, w=25)hashes computed 339581fingerprints selected 25861measured density 0.0761expected density 0.0769matches found 496false-positives 394

Table B.2.: Bytecode (k=50, w=50)hashes computed 338781fingerprints selected 13046measured density 0.0385expected density 0.0392matches found 441false-positives 339

Table B.3.: Bytecode (k=150, w=50)hashes computed 335581fingerprints selected 13084measured density 0.0390expected density 0.0392matches found 228false-positives 126

35

Page 44: WinnowingAlgorithmforProgramCode - TUHH · PDF fileBachelor Thesis Abdul Sabor Bostan WinnowingAlgorithmforProgramCode 16 July 2017 supervised by: Prof. Dr. Schupp HamburgUniversityofTechnology(TUHH)

B. Results

B.2. Source Code

Table B.4.: Source code (k=150, w=150)hashes computed 73158fingerprints selected 928measured density 0.0127expected density 0.0132matches found 6false-positives 0

Table B.5.: Source code (k=100, w=100)hashes computed 74758fingerprints selected 1459measured density 0.0195expected density 0.0099matches found 14false-positives 0

Table B.6.: Source code (k=75, w=75)hashes computed 75558fingerprints selected 1949measured density 0.0258expected density 0.0263matches found 21false-positives 0

36