Split large xml into parts before processing in PHP

  • Sumo

PHP has very powerful xml functions and libraries in the form of DOMDocument, Xpath, SimpleXML and a few others. But all of them have a scalability problem in that, they all need to load the entire xml document into memory first before any processing can be done.

Tbis is fine as long as the xml files are a few hundred kb in size, but once you start dealing with large xml files of 2Mb, 8Mb or more then you will find the documents just wont load unless you have a large amount of memory at PHP’s disposal. But even with a 250Mb available memory to PHP , loading a 5 mb document will take forever, if it loads at all. You cant have that kind of a bottleneck on production applications.

To solve this problem, we can break the original xml document into several xml documents and then load each of those files for processing. This approach is not only much faster than trying to process the original document but keeps server load to a minimum.

SPLITTING LOGIC

The whole concept of splitting an xml is based on the premise that you have a repeating tag which will form the boundary tag on which the split will take place.The original xml file will be split into multiple files which will be written onto a specified location on the server and the array of the generated filenames will be returned. This logic also assumes that each xml tag line ends with a new line character , which is standard behavior for well-generated xml files.

The following terms are to be understood:

Boundary Tag – The xml tag at which the splitting will occur

Start At – Each generated file will be given a number eg, 1,2,3,4 . This value determines where the numbering should start from. Generally it would be set to zero or 1.

Max Items – After how many occurrences of the boundary tag should the split be made. Eg.100 or 500

Fixed FooterThis is explained in detail below later.

An example of a split is given below:

The data appearing before the first occurrence of the boundary tag is the header part. The data appearing after the last occurrence of the boundary tag is the footer part.

So every split file that is created must have the header part , then the extracted boundary tag data and then the footer in the end for it to be a valid xml file similar to the original xml file.

So for the above example with 500 product tags, with max items set at 100,. it will create 5 files of 100 products each.

The entire logic has been made into a PHP function which can be used anywhere.

/**
	 * Function to break an xml file into several smaller files 
	 * If the orig xml file is smaller than max size then it will be converted into a single file
	 * @param string $boundaryTag for product boundary tag name
	 * @param int $startAt file number to start at 
	 * @param int maxItems how many occurences of the item to break the file at
	 * @param string $rawdata the raw data from the original xml file
	 * @param string $fixedFooter if not null then footer will be this string and not computed
	 * @returns $arrFiles array of filenames created
	 **/
	function breakIntoFiles($boundaryTag, $startAt, $maxItems, $rawdata, $fixedFooter) {
		 
			$arr = explode("\n",$rawdata);
			$items = 0; // no.of items done in loop. resets to zero everytime a file is created
			$files = $startAt; // count of files created
			$length= count($arr); 
			$header = ""; // header block for xml file
			$footer = ""; // footer block for xml file
			$chunk = "";  // chunk of xml data to be written into file
			$arrFiles = array(); // array of files created
			$boundaryIsFound = false; // true when first boundary tag is found
			$fileWritten = false;	 // false if some data has not been written to file

					// get footer data
			$footerBreak= "</" . trim($boundaryTag). ">";		

			for ($i = $length-1; $i>= 0; $i--){
				$line = $arr[$i];
				if (strpos($line, $footerBreak) == false) {
					$footer = $line . "\r\n" . $footer;
				}
				else
					break;
			}

					// process main data		
			for ($i = 0;$i < $length; $i++){
				$line  = $arr[$i];
							
				if (strpos($line, "<". trim($boundaryTag) . ">") !== false ||
					strpos($line, "<" . trim($boundaryTag) ." ") !== false) {
					$items ++;
					$boundaryIsFound = true;
				}


				if (!$boundaryIsFound)
					$header .= $line . "\r\n";
	
				
				if ($items >= $maxItems) {
					$items = 0;
					$files++;

					$filename =  $files . ".xml";
					$f = fopen($filename, "w");
					fwrite($f,$header);
					fwrite($f, $chunk);
					if ($fixedFooter == null || $fixedFooter == '')
						fwrite($f, $footer);
					else
						fwrite($f, $fixedFooter);	
					fclose($f);
					$arrFiles[] = $filename;
					$chunk = $line . "\r\n";
					$fileIsWritten = true;
				}
				else {
					$fileIsWritten = false;
					if ($boundaryIsFound)
						$chunk .= $line . "\r\n";
				}
			}

			if (!$fileIsWritten ) {
					$files++;

					$filename =  $files . ".xml";
					$f = fopen($filename, "w");
					fwrite($f,$header);
					fwrite($f, $chunk);
					fclose($f);
					$arrFiles[] = $filename;
				
			}

			return $arrFiles;

	}				

EXPLANATION OF THE FIXED FOOTER PARAMETER

By default whatever data is there after the last occurrence of the boundary tag, will be added as the footer to the generated xml file. Sometimes you may not want this behavior. You may want to add in your custom data to the footer. This data can be passed in as a string in the fixedFooter argument. In such a case it is your responsibility to ensure that the data is such that the resulting xml file is a valid xml file with all open tags closed in the end.

6 Comments on Split large xml into parts before processing in PHP

  1. Hello, thank you for the code, can you give an example how to apply this?
    i tried with my 15mb xml file but i’ve got ERROR
    Warning: explode() expects parameter 2 to be string, object given in

    i tried:

    $boundaryTag=”row”;
    $fixedFooter=”csv_data”;
    $dom = new DOMDocument();
    $dom->loadXML(file_get_contents(“file-ready.xml”));
    breakIntoFiles($boundaryTag, 1, 5000, $dom, $fixedFooter)

    structure of xml is:

    Please advise

  2. Hello,
    Thank you for the code. If I think well, then:

    $boundaryTag=”;
    $startAt=1;
    $maxItems=2000;
    $myxmlfile=”../xml/verybig.xml”;
    while(!feof($myxmlfile)) {
    $rowdata=fgets($myxmlfile);
    breakIntoFiles($boundaryTag, $startAt, $maxItems, $rawdata, $fixedFooter);
    }

    Unfortunately it’s not working.

  3. @Lee Actually the usage is wrong. You are calling the function after reading each row of the xml file. The correct way is to load all the data from the xml file and pass it to the function.

Leave a Reply

Your email address will not be published.


*


*