Class NGramDetector
- java.lang.Object
-
- uk.ac.warwick.dcs.sherlock.api.model.detection.Detector<T>
-
- uk.ac.warwick.dcs.sherlock.api.model.detection.PairwiseDetector<NGramDetector.NGramDetectorWorker>
-
- uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector
-
- All Implemented Interfaces:
IDetector<NGramDetector.NGramDetectorWorker>
public class NGramDetector extends PairwiseDetector<NGramDetector.NGramDetectorWorker>
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description class
NGramDetector.NGramDetectorWorker
The main processing method used in the detector
-
Field Summary
Fields Modifier and Type Field Description int
minimum_window
The minimum size of a list of N-Grams before checks begin.int
ngram_size
The character width of each N-Gram used in the detection.float
threshold
The threshold on the similarity value over which something is considered suspicious.
-
Constructor Summary
Constructors Constructor Description NGramDetector()
Sets meta data for the detector, along with providing the API with pointers to the Worker and the Preprocessing Strategy
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description float
compare(java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> string1, java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> string2)
Compare 2 lists of N-grams and return a similarity metricvoid
matchFound(java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> reference, java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> check, uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram head, float last_peak, int since_last_peak, ISourceFile file1, ISourceFile file2)
-
Methods inherited from class uk.ac.warwick.dcs.sherlock.api.model.detection.PairwiseDetector
buildWorkers, getAbstractPairwiseDetectorWorker
-
Methods inherited from class uk.ac.warwick.dcs.sherlock.api.model.detection.Detector
getDescription, getDisplayName, getPreProcessors, setDescription
-
-
-
-
Field Detail
-
ngram_size
@AdjustableParameter(name="N-Gram Size", defaultValue=4.0f, minimumBound=1.0f, maxumumBound=10.0f, step=1.0f, description="The width in characters of each N-gram. Smaller is more sensitive.") public int ngram_size
The character width of each N-Gram used in the detection.In theory smaller is more sensitive, but realistically you don't want to use lower than 3 or higher than 8.
-
minimum_window
@AdjustableParameter(name="Minimum Window", defaultValue=5.0f, minimumBound=0.0f, maxumumBound=20.0f, step=1.0f, description="The minimum number of N-grams that can be detected as a matched block. Character width of minimum block is N-gram size + minimum window - 1.") public int minimum_window
The minimum size of a list of N-Grams before checks begin.N-Grams are put into a linked list when being matched, to prevent a match being detected for a short number of N-Grams (e.g. picking up things like a for loop) a minimum window size is used. Before this size is reached if the match ends then nothing is flagged.
-
threshold
@AdjustableParameter(name="Threshold", defaultValue=0.8f, minimumBound=0.0f, maxumumBound=1.0f, step=0.001f, description="The threshold on the similarity at which a block of code will be no longer considered similar. This determines where the similarity ends, 1 will give only pure matches, 0 will match anything") public float threshold
The threshold on the similarity value over which something is considered suspicious.The 2 lists of N-Grams are compared to produce a similaity value between 0 and 1, with 1 being identical. This threshold decides at what point to consider a segment as similar, and when it's long enough to consider it possible plagerism.
-
-
Method Detail
-
compare
public float compare(java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> string1, java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> string2)
Compare 2 lists of N-grams and return a similarity metricFinds the Jaccard Similarity of the 2 lists of Ngrams
- Parameters:
string1
- The reference N-gram liststring2
- The check N-gram list- Returns:
- The float val for Jaccard Similarity
-
matchFound
public void matchFound(java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> reference, java.util.ArrayList<uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram> check, uk.ac.warwick.dcs.sherlock.module.model.base.detection.NGramDetector.Ngram head, float last_peak, int since_last_peak, ISourceFile file1, ISourceFile file2)
-
-