Class NGramDetector

    • Field Summary

      Fields 
      Modifier and Type Field Description
      int minimum_window
      The minimum size of a list of N-Grams before checks begin.
      int ngram_size
      The character width of each N-Gram used in the detection.
      float threshold
      The threshold on the similarity value over which something is considered suspicious.
    • Constructor Summary

      Constructors 
      Constructor Description
      NGramDetector()
      Sets meta data for the detector, along with providing the API with pointers to the Worker and the Preprocessing Strategy
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      float compare​(java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> string1, java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> string2)
      Compare 2 lists of N-grams and return a similarity metric
      void matchFound​(java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> reference, java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> check, uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram head, float last_peak, int since_last_peak, ISourceFile file1, ISourceFile file2)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • ngram_size

        @AdjustableParameter(name="N-Gram Size",
                             defaultValue=4.0f,
                             minimumBound=1.0f,
                             maxumumBound=10.0f,
                             step=1.0f,
                             description="The width in characters of each N-gram. Smaller is more sensitive.")
        public int ngram_size
        The character width of each N-Gram used in the detection.

        In theory smaller is more sensitive, but realistically you don't want to use lower than 3 or higher than 8.

      • minimum_window

        @AdjustableParameter(name="Minimum Window",
                             defaultValue=5.0f,
                             minimumBound=0.0f,
                             maxumumBound=20.0f,
                             step=1.0f,
                             description="The minimum number of N-grams that can be detected as a matched block. Character width of minimum block is N-gram size + minimum window - 1.")
        public int minimum_window
        The minimum size of a list of N-Grams before checks begin.

        N-Grams are put into a linked list when being matched, to prevent a match being detected for a short number of N-Grams (e.g. picking up things like a for loop) a minimum window size is used. Before this size is reached if the match ends then nothing is flagged.

      • threshold

        @AdjustableParameter(name="Threshold",
                             defaultValue=0.8f,
                             minimumBound=0.0f,
                             maxumumBound=1.0f,
                             step=0.001f,
                             description="The threshold on the similarity at which a block of code will be no longer considered similar. This determines where the similarity ends, 1 will give only pure matches, 0 will match anything")
        public float threshold
        The threshold on the similarity value over which something is considered suspicious.

        The 2 lists of N-Grams are compared to produce a similaity value between 0 and 1, with 1 being identical. This threshold decides at what point to consider a segment as similar, and when it's long enough to consider it possible plagerism.

    • Constructor Detail

      • NGramDetector

        public NGramDetector()
        Sets meta data for the detector, along with providing the API with pointers to the Worker and the Preprocessing Strategy
    • Method Detail

      • compare

        public float compare​(java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> string1,
                             java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> string2)
        Compare 2 lists of N-grams and return a similarity metric

        Finds the Jaccard Similarity of the 2 lists of Ngrams

        Parameters:
        string1 - The reference N-gram list
        string2 - The check N-gram list
        Returns:
        The float val for Jaccard Similarity
      • matchFound

        public void matchFound​(java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> reference,
                               java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> check,
                               uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram head,
                               float last_peak,
                               int since_last_peak,
                               ISourceFile file1,
                               ISourceFile file2)