Detecting duplicate code in PHP files

Duplicated code in projects is a frequent thing and also the one ripe for factoring out in a new class or function. Cut/Paste coding is a common development practice among programmers, a lot of which can lead to code size increase and maintenance nightmares. PHPCPD (php copy paste detector) is a PEAR tool that makes it easier to detect duplicate code in php projects. Below is a short tutorial on the PHPCPD package.

1. Installing phpcpd
We will be using the PEAR installer for this purpose. First the PEAR channel that is used to distribute phpcpd needs to be registered with the local PEAR environment. This tells PEAR from where the install files should be downloaded.

c:\> pear channel-discover pear.phpunit.de
Adding Channel "pear.phpunit.de" succeeded
Discovery of channel "pear.phpunit.de" succeeded

After this is done the PEAR installer is ready to install phpcpd.

c:\> pear install phpunit/phpcpd
downloading phpcpd-1.1.1.tgz ...
Starting to download phpcpd-1.1.1.tgz (8,078 bytes)
.....done: 8,078 bytes
install ok: channel://pear.phpunit.de/phpcpd-1.1.1

Post installation you will find the PHPCPD source files in the PEAR directory.

2. Running your first check.
Here is our first check on the admin sub directory of a project with the results. You can check for duplicate code in a individual file or a directory.

c:\localhost\project> phpcpd ./admin
phpcpd 1.1.1 by Sebastian Bergmann.
 
Found 3 exact clones with 50 duplicated lines in 6 files:
 
  - .messages.php:95-105
    .messagesgroup.php:112-122
 
  - .viewschedules.php:14-23
    .tutorbookings.php:14-23
 
  - .ampieexport.php:4-35
    .amcolumnexport.php:4-35
 
0.35% duplicated lines out of 14456 total lines of code.

3. Options
By default phpcpd will search for a minimum of 5 identical lines and 70 identical tokens. So if there are less than 5 duplicate lines in the code or less than 70 identical tokens they will be ignored. To override this you can use the –min-lines and –min-tokens switch as below.

c:\localhost\project> phpcpd --min-lines 4 --min-tokens 40 ./admin
phpcpd 1.1.1 by Sebastian Bergmann.
 
Found 9 exact clones with 187 duplicated lines in 14 files:
 
  - .actionaction.updatestudent.php:15-27
    .actionaction.updatetutor.php:15-27
 
  - .adminFunctions.php:91-98
    .adminFunctions.php:124-131
 
  - .messages.php:95-118
    .messagesgroup.php:112-135
 
  - .viewschedules.php:14-84
    .tutorbookings.php:14-84
 
  - .viewschedules.php:167-185
    .tutorbookings.php:145-163
 
  - .tutors.php:14-20
    .dimdim.php:14-20
 
  - .ampieexport.php:4-44
    .amcolumnexport.php:4-44
 
  - .geoipgeoip.inc.php:236-241
    .geoipgeoip.inc.php:272-277
 
  - .studentschedule.php:14-20
    .Copy of onetomanyschedule.php:14-20
 
1.29% duplicated lines out of 14457 total lines of code.

The report generated by phpcpd can also be exported to a PMD-CPD xml format. The following scans the admin directory and saves the report in the projectPhpcpd.xml file.

c:\localhost\project> phpcpd --log-pmd projectPhpcpd.xml ./admin

Most of the php source files have the .php extension and phpcpd uses this by default when comparing files. To add other extensions to the list you can use the –suffixes option, which takes a comma separated list of extension names.

c:\localhost\project> phpcpd --suffixes php,php5 ./admin

Concluding thoughts
There is also a Java program called PMD which can detect duplicate code. But the main advantage of a PEAR package is that you can integrate it in your project itself or use it with phpUnderControl.



9 thoughts on “Detecting duplicate code in PHP files

  1. Great article! Was wondering if there are any flags for excluding or ignoring specific file/folder(s).

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>