Detect Similar Documents Using PHP

Discovering identical content or similar content is done using an algorithm called Finding Document Distance. Using this algorithm, applications for finding copyright infringements, detecting duplicate content and similar content can be effectively created.

You can find explanations of this algorithm by searching on Google so this post will not go into the details of the algorithm.

Given below is a PHP implementation with sample text files. You are free to use or adapt the code for your own needs . Any pointers, queries and constructive criticism is welcome.

<?php
 // Find Document Distance: Given two documents, how similar are they
 //
 // Copyright (C) 2015  Amit Sengupta, amit@truelogic.org
 //
 // This program is free software: you can redistribute it and/or modify
 // it under the terms of the GNU General Public License as published by
 // the Free Software Foundation, either version 3 of the License, or
 // (at your option) any later version.
 //
 // This program is distributed in the hope that it will be useful,
 // but WITHOUT ANY WARRANTY; without even the implied warranty of
 // MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 // GNU General Public License for more details.
 //
 // You should have received a copy of the GNU General Public License
 // along with this program.  If not, see <http://www.gnu.org/licenses/>.
 // ..........................................................................
 //
 //
 // Reference: http://courses.csail.mit.edu/6.006/spring11/lectures/lec01.pdf
 // Algorithm:
 //	1.Read each document
 //	2.Split each document into words. A valid word is at least 3 characters long.
 //	3.Count word frequencies (document vectors) for each document
 //	4.Compute dot product for doc 1 and 2
 //	5.Compute dot product for doc 1 and 1
 //	6.Compute dot product for doc 2 and 2
 //	7.Compute document distance . 0=identical, 1.57= completely different
 //
    error_reporting(E_ALL ^ E_WARNING ^ E_NOTICE ^ E_DEPRECATED);

    

    echo("<h3>file1.txt and file2.txt are identical</h3>");

       // read file 1
    $file1 = __FILE__;
    $file1 = substr($file1, 0, strrpos($file1, "/")) . "/file1.txt";
    
    // read file 2
    $file2 = __FILE__;
    $file2 = substr($file2, 0, strrpos($file2, "/")) . "/file2.txt";
   

    findDistance($file1, $file2);


    echo("<h3>file1.txt and file3.txt are same but file3 has some lines removed</h3>");

       // read file 1
    $file1 = __FILE__;
    $file1 = substr($file1, 0, strrpos($file1, "/")) . "/file1.txt";
    
    // read file 3
    $file3 = __FILE__;
    $file3 = substr($file3, 0, strrpos($file3, "/")) . "/file3.txt";
   

    findDistance($file1, $file3);


    echo("<h3>file1.txt and file4.txt are same but file4 has extra content added on the same topic</h3>");

       // read file 1
    $file1 = __FILE__;
    $file1 = substr($file1, 0, strrpos($file1, "/")) . "/file1.txt";
    
    // read file 4
    $file4 = __FILE__;
    $file4 = substr($file4, 0, strrpos($file4, "/")) . "/file4.txt";
   

    findDistance($file1, $file4);


    echo("<h3>file1.txt and file5.txt are same but file5 has content of file1 changed by an article spinner app.</h3>");

       // read file 1
    $file1 = __FILE__;
    $file1 = substr($file1, 0, strrpos($file1, "/")) . "/file1.txt";
    
    // read file 5
    $file5 = __FILE__;
    $file5 = substr($file5, 0, strrpos($file5, "/")) . "/file5.txt";
   

    findDistance($file1, $file5);


   echo("<h3>file1.txt and file6.txt are same but file6 has content of file1 changed by a second article spinner app.</h3>");

       // read file 1
    $file1 = __FILE__;
    $file1 = substr($file1, 0, strrpos($file1, "/")) . "/file1.txt";
    
    // read file 6
    $file6 = __FILE__;
    $file6 = substr($file6, 0, strrpos($file6, "/")) . "/file6.txt";
   

    findDistance($file1, $file6);
    
    
    

    ///////////////////////////////////////////////////////////////////////////////////

    /**
     * Find distance between two files
     * @param string $file1 file 1
     * @param string $file2 file 2
     */
    function findDistance($file1, $file2) {

	// handle file 1
	$data1 = readData($file1);
	$arrWordCount1 = processFile($data1);
    
	// handle file 2
	$data2 = readData($file2);
	$arrWordCount2 = processFile($data2);


	// compute inner product for 1 and 2
	$arrInnerProduct = getInnerProduct($arrWordCount1, $arrWordCount2);
	$innerProductForBoth  = $arrInnerProduct["INNERPRODUCT"];
	$arrFreq = $arrInnerProductForBoth["FREQUENCIES"];
	echo("Inner Product for 1 and 2 =" . $innerProductForBoth . "<br>");

	// compute inner product for 1 and 1
	$arrInnerProductFor1 = getInnerProduct($arrWordCount1, $arrWordCount1);
	$innerProductFor1  = $arrInnerProductFor1["INNERPRODUCT"];
	$arrFreq1 = $arrInnerProductFor1["FREQUENCIES"];
	echo("Inner Product for 1 and 1 =" . $innerProductFor1 . "<br>");


	// compute inner product for 2 and 2
	$arrInnerProductFor2 = getInnerProduct($arrWordCount2, $arrWordCount2);
	$innerProductFor2  = $arrInnerProductFor2["INNERPRODUCT"];
	$arrFreq2 = $arrInnerProductFor2["FREQUENCIES"];
	echo("Inner Product for 2 and 2 =" . $innerProductFor2 . "<br>");



	// computer document distance
	$numerator = $innerProductForBoth;
	$denominator = sqrt($innerProductFor1 * $innerProductFor2);
	
	// Important: acos returns NaN if value is > 1 so handle that
	$distance = acos($numerator / $denominator);
	

	// display distance
	if (is_nan($distance)) {
	    $percent = 100;
	    $distance = 1;
	}	    
	$percent = 100 - ($distance*100)/1.57;

	echo("<br>Document distance =" . $distance . "<br>");
	echo("Probability of documents being same: " . number_format($percent,2) . "%<br>");
    }
    

    /**
     * Read text from file and return it
     * @param string $path full path of file to read
     * @return $text string file data
     */
    function readData($path) {
	$f = fopen($path, "r");
	if (!$f)
	    exit("Error reading " . $path);
	
	$text = fread($f, filesize($path));
	fclose($f);

	return $text;
    }
    

    /**
     * Parse file data into words , count their frequencies and return array of wordcount
     * @param string $data file data
     * @return array $arrWordCount frequency count of each word
     */
    function processFile($data) {
	$words = array();

	// first convert all non-alphabet and non-number characters into spaces.
	$data = preg_replace('/[^0-9a-zA-Z]/i',' ', $data);

	//parse words from text
	$count = preg_match_all('/(\S{3,})/i', $data, $match_array);
	if ($count > 0) 
	    $words = $match_array[0];

	// get count of each word in array
	$arrWordCount = array_count_values($words);

	return $arrWordCount;
    } 


    /**
     * Process two word count arrays and compute dot product
     * @param array @arr1 array of word count
     * @param array $arr2 array of word count
     * @return array array of frequencies and dot product
     */
    function getInnerProduct($arr1, $arr2) {
	$count1 = 0;
	$count2 = 0;
	$finalTotal = 0;
	$dot = array();
	
	foreach($arr1 as $key=>$value) {
	    $count1 = $value;
	    if (array_key_exists($key, $arr2)) {
		$count2 = $arr2[$key];
		$total = $count1 * $count2;
		$finalTotal += $total;
		$dot[$key] = $total;
	    }
	}


	return array("INNERPRODUCT"=>$finalTotal, "FREQUENCIES"=>$dot);
    } 


?> 

The output of the code is shown below:

2015-04-26 13-18-56

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The sample files used are given below:

file1 file2 file3 file4 file5 file6

Be the first to comment

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.