{"id":1403,"date":"2012-07-03T12:02:09","date_gmt":"2012-07-03T12:02:09","guid":{"rendered":"http:\/\/truelogic.org\/wordpress\/?p=1403"},"modified":"2012-07-03T12:02:09","modified_gmt":"2012-07-03T12:02:09","slug":"split-large-xml-into-parts-before-processing-in-php","status":"publish","type":"post","link":"https:\/\/truelogic.org\/wordpress\/2012\/07\/03\/split-large-xml-into-parts-before-processing-in-php\/","title":{"rendered":"Split large xml into parts before processing in PHP"},"content":{"rendered":"            <script type=\"text\/javascript\" src=\"https:\/\/truelogic.org\/wordpress\/wp-content\/plugins\/wordpress-code-snippet\/scripts\/shBrushPhp.js\"><\/script>\n<p><a href=\"https:\/\/truelogic.org\/wordpress\/2012\/07\/03\/split-large-xml-into-parts-before-processing-in-php\/cake\/\" rel=\"attachment wp-att-1419\"><img loading=\"lazy\" decoding=\"async\" class=\"alignleft size-full wp-image-1419\" title=\"cake\" src=\"https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2012\/07\/cake.jpeg\" alt=\"\" width=\"259\" height=\"194\" \/><\/a>PHP has very powerful xml functions and libraries in the form of DOMDocument, Xpath, SimpleXML and a few others. But all of them have a scalability problem in that, they all need to load the entire xml document into memory first before any processing can be done.<\/p>\n<p>Tbis is fine as long as the xml files are a few hundred kb in size, but once you start dealing with large xml files of 2Mb, 8Mb or more then you will find the documents just wont load unless you have a large amount of memory at PHP&#8217;s disposal. But even with a 250Mb available memory to PHP , loading a 5 mb document will take forever, if it loads at all. You cant have that kind of a bottleneck on production applications.<\/p>\n<p>To solve this problem, we can break the original xml document into several xml documents and then load each of those files for processing. This approach is not only much faster than trying to process the original document but keeps server load to a minimum.<\/p>\n<p><strong>SPLITTING LOGIC<\/strong><\/p>\n<p>The whole concept of splitting an xml is based on the premise that you have a repeating tag which will form the boundary tag on which the split will take place.The original xml file will be split into multiple files which will be written onto a specified location on the server and the array of the generated filenames will be returned. <strong>This logic also assumes that each xml tag line ends with a new line character , which is standard behavior for well-generated xml files.<\/strong><\/p>\n<p>The following terms are to be understood:<\/p>\n<p><em><strong>Boundary Tag &#8211; <\/strong>The xml tag at which the splitting will occur<br \/>\n<\/em><\/p>\n<p><strong><em>Start At &#8211; <\/em><\/strong><em>Each generated file will be given a number eg, 1,2,3,4 . This value determines where the numbering should start from. Generally it would be set to zero or 1.<br \/>\n<\/em><\/p>\n<p><strong><em>Max Items &#8211; <\/em><\/strong><em>After how many occurrences of the boundary tag should the split be made. Eg.100 or 500<\/em><\/p>\n<p><strong><em>Fixed Footer<\/em><\/strong> &#8211; <em>This is explained in detail below later.<\/em><\/p>\n<p>An example of a split is given below:<\/p>\n<p><em><a href=\"https:\/\/truelogic.org\/wordpress\/2012\/07\/03\/split-large-xml-into-parts-before-processing-in-php\/show\/\" rel=\"attachment wp-att-1406\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-1406\" title=\"show\" src=\"https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2012\/07\/show.jpeg\" alt=\"\" width=\"1024\" height=\"377\" srcset=\"https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2012\/07\/show.jpeg 1024w, https:\/\/truelogic.org\/wordpress\/wp-content\/uploads\/2012\/07\/show-300x110.jpg 300w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><\/em>The data appearing before the first occurrence of the boundary tag is the header part. The data appearing after the last occurrence of the boundary tag is the footer part.<\/p>\n<p>So every split file that is created must have the header part , then the extracted boundary tag data and then the footer in the end for it to be a valid xml file similar to the original xml file.<\/p>\n<p>So for the above example with 500 product tags, with max items set at 100,. it will create 5 files of 100 products each.<\/p>\n<p>The entire logic has been made into a PHP function which can be used anywhere.<br \/>\n<pre class=\"brush: php\">\/**\r\n\t * Function to break an xml file into several smaller files \r\n\t * If the orig xml file is smaller than max size then it will be converted into a single file\r\n\t * @param string $boundaryTag for product boundary tag name\r\n\t * @param int $startAt file number to start at \r\n\t * @param int maxItems how many occurences of the item to break the file at\r\n\t * @param string $rawdata the raw data from the original xml file\r\n\t * @param string $fixedFooter if not null then footer will be this string and not computed\r\n\t * @returns $arrFiles array of filenames created\r\n\t **\/\r\n\tfunction breakIntoFiles($boundaryTag, $startAt, $maxItems, $rawdata, $fixedFooter) {\r\n\t\t \r\n\t\t\t$arr = explode(&quot;\\n&quot;,$rawdata);\r\n\t\t\t$items = 0; \/\/ no.of items done in loop. resets to zero everytime a file is created\r\n\t\t\t$files = $startAt; \/\/ count of files created\r\n\t\t\t$length= count($arr); \r\n\t\t\t$header = &quot;&quot;; \/\/ header block for xml file\r\n\t\t\t$footer = &quot;&quot;; \/\/ footer block for xml file\r\n\t\t\t$chunk = &quot;&quot;;  \/\/ chunk of xml data to be written into file\r\n\t\t\t$arrFiles = array(); \/\/ array of files created\r\n\t\t\t$boundaryIsFound = false; \/\/ true when first boundary tag is found\r\n\t\t\t$fileWritten = false;\t \/\/ false if some data has not been written to file\r\n\r\n\t\t\t\t\t\/\/ get footer data\r\n\t\t\t$footerBreak= &quot;&lt;\/&quot; . trim($boundaryTag). &quot;&gt;&quot;;\t\t\r\n\r\n\t\t\tfor ($i = $length-1; $i&gt;= 0; $i--){\r\n\t\t\t\t$line = $arr[$i];\r\n\t\t\t\tif (strpos($line, $footerBreak) == false) {\r\n\t\t\t\t\t$footer = $line . &quot;\\r\\n&quot; . $footer;\r\n\t\t\t\t}\r\n\t\t\t\telse\r\n\t\t\t\t\tbreak;\r\n\t\t\t}\r\n\r\n\t\t\t\t\t\/\/ process main data\t\t\r\n\t\t\tfor ($i = 0;$i &lt; $length; $i++){\r\n\t\t\t\t$line  = $arr[$i];\r\n\t\t\t\t\t\t\t\r\n\t\t\t\tif (strpos($line, &quot;&lt;&quot;. trim($boundaryTag) . &quot;&gt;&quot;) !== false ||\r\n\t\t\t\t\tstrpos($line, &quot;&lt;&quot; . trim($boundaryTag) .&quot; &quot;) !== false) {\r\n\t\t\t\t\t$items ++;\r\n\t\t\t\t\t$boundaryIsFound = true;\r\n\t\t\t\t}\r\n\r\n\r\n\t\t\t\tif (!$boundaryIsFound)\r\n\t\t\t\t\t$header .= $line . &quot;\\r\\n&quot;;\r\n\t\r\n\t\t\t\t\r\n\t\t\t\tif ($items &gt;= $maxItems) {\r\n\t\t\t\t\t$items = 0;\r\n\t\t\t\t\t$files++;\r\n\r\n\t\t\t\t\t$filename =  $files . &quot;.xml&quot;;\r\n\t\t\t\t\t$f = fopen($filename, &quot;w&quot;);\r\n\t\t\t\t\tfwrite($f,$header);\r\n\t\t\t\t\tfwrite($f, $chunk);\r\n\t\t\t\t\tif ($fixedFooter == null || $fixedFooter == &#039;&#039;)\r\n\t\t\t\t\t\tfwrite($f, $footer);\r\n\t\t\t\t\telse\r\n\t\t\t\t\t\tfwrite($f, $fixedFooter);\t\r\n\t\t\t\t\tfclose($f);\r\n\t\t\t\t\t$arrFiles[] = $filename;\r\n\t\t\t\t\t$chunk = $line . &quot;\\r\\n&quot;;\r\n\t\t\t\t\t$fileIsWritten = true;\r\n\t\t\t\t}\r\n\t\t\t\telse {\r\n\t\t\t\t\t$fileIsWritten = false;\r\n\t\t\t\t\tif ($boundaryIsFound)\r\n\t\t\t\t\t\t$chunk .= $line . &quot;\\r\\n&quot;;\r\n\t\t\t\t}\r\n\t\t\t}\r\n\r\n\t\t\tif (!$fileIsWritten ) {\r\n\t\t\t\t\t$files++;\r\n\r\n\t\t\t\t\t$filename =  $files . &quot;.xml&quot;;\r\n\t\t\t\t\t$f = fopen($filename, &quot;w&quot;);\r\n\t\t\t\t\tfwrite($f,$header);\r\n\t\t\t\t\tfwrite($f, $chunk);\r\n\t\t\t\t\tfclose($f);\r\n\t\t\t\t\t$arrFiles[] = $filename;\r\n\t\t\t\t\r\n\t\t\t}\r\n\r\n\t\t\treturn $arrFiles;\r\n\r\n\t}\t\t\t\t<\/pre><\/p>\n<p><strong>EXPLANATION OF THE FIXED FOOTER PARAMETER<\/strong><\/p>\n<p>By default whatever data is there after the last occurrence of the boundary tag, will be added as the footer to the generated xml file. Sometimes you may not want this behavior. You may want to add in your custom data to the footer. This data can be passed in as a string in the fixedFooter argument. In such a case it is your responsibility to ensure that the data is such that the resulting xml file is a valid xml file with all open tags closed in the end.<\/p>\n","protected":false},"excerpt":{"rendered":"<div class=\"mh-excerpt\"><p>PHP has very powerful xml functions and libraries in the form of DOMDocument, Xpath, SimpleXML and a few others. But all of them have a <a class=\"mh-excerpt-more\" href=\"https:\/\/truelogic.org\/wordpress\/2012\/07\/03\/split-large-xml-into-parts-before-processing-in-php\/\" title=\"Split large xml into parts before processing in PHP\">[&#8230;]<\/a><\/p>\n<\/div>","protected":false},"author":1,"featured_media":1419,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-1403","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-apachephp"],"_links":{"self":[{"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/posts\/1403","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/comments?post=1403"}],"version-history":[{"count":19,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/posts\/1403\/revisions"}],"predecessor-version":[{"id":2702,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/posts\/1403\/revisions\/2702"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/media\/1419"}],"wp:attachment":[{"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/media?parent=1403"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/categories?post=1403"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/truelogic.org\/wordpress\/wp-json\/wp\/v2\/tags?post=1403"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}