CPD: Copy and Paste Detector: a tool for finding where source code has been duplicated/cloned.
CPD: Copy and Paste Detector: a tool for finding where source code has been duplicated/cloned.
There are 3 fundamental types of these tools:
- Those that match text strings or lines exactly; they have essentially zero knowledge of the actual language being processed. These find exact clones; changes in formatting or additional comments prevent detection of larger matches. They can be fast and scalable, but only find exact copies, and thus don't produce good answers if the cloned code has been edited, which is the common case. Summary: cheap, easy, weak detection ability.
- Token-based detectors. These detectors know roughly have to break a source code into its constituent atoms ("tokens") such as identifiers, numbers, keywords, operators, comments and whitespace. Knowledge of whitespace and comments allows the detector to match code that has been reformatted. Ignoring the content of identifiers and numbers allows such detectors to match code where names have been changed or different values have been used. But these detectors don't understand language structure, and tend to treat "} {" as clones, in spite of the fact they are uninteresting clones. As a consequence, token based detectors have to match rather long sequences of tokens to avoid producing a lot of false positive matches. Summary: better, requires very long matches to avoid flood of false positives.
- Structure-based detectors. These know the token and language structure. Like token detectors, reformatting doesn't prevent matches. Unlike token detectors, these tools only identify clones that match language structures, such as expressions, statements, or blocks; they can never propose "} {". So they can find smaller clones reliably. They can also allow gaps between the identical parts that match stuctures, so they can recognize two identical statements separated by third differing statement, as a clone with the third statement as a parameter. This allows detection of sophisticated clones. Summary: slower, but more accurate and more interesting clones detected.
[Thanks to Semantic Designs for this background knowledge].
See http://en.wikipedia.org/wiki/Duplicate_code for more details.