Viewed   154 times

I'm using SimpleXML to load in some xml files (which I didn't write/provide and can't really change the format of).

Occasionally (eg one or two files out of every 50 or so) they don't escape any special characters (mostly &, but sometimes other random invalid things too). This creates and issue because SimpleXML with php just fails, and I don't really know of any good way to handle parsing invalid XML.

My first idea was to preprocess the XML as a string and put ALL fields in as CDATA so it would work, but for some ungodly reason the XML I need to process puts all of its data in the attribute fields. Thus I can't use the CDATA idea. An example of the XML being:

 <Author v="By Someone & Someone" />

Whats the best way to process this to replace all the invalid characters from the XML before I load it in with SimpleXML?

 Answers

5

What you need is something that will use libxml's internal errors to locate invalid characters and escape them accordingly. Here's a mockup of how I'd write it. Take a look at the result of libxml_get_errors() for error info.

function load_invalid_xml($xml)
{
    $use_internal_errors = libxml_use_internal_errors(true);
    libxml_clear_errors(true);

    $sxe = simplexml_load_string($xml);

    if ($sxe)
    {
        return $sxe;
    }

    $fixed_xml = '';
    $last_pos  = 0;

    foreach (libxml_get_errors() as $error)
    {
        // $pos is the position of the faulty character,
        // you have to compute it yourself
        $pos = compute_position($error->line, $error->column);
        $fixed_xml .= substr($xml, $last_pos, $pos - $last_pos) . htmlspecialchars($xml[$pos]);
        $last_pos = $pos + 1;
    }
    $fixed_xml .= substr($xml, $last_pos);

    libxml_use_internal_errors($use_internal_errors);

    return simplexml_load_string($fixed_xml);
}
Wednesday, November 30, 2022
3

So basically what you need to do is a function that takes each <asset/> child of current node, builds the HTML then checks if the current node has <asset/> children of its own and keeps recursing deeper down the tree.

Here's how you can do it:

function printAssetMap()
{
    return printAssets(simplexml_load_file(X_ASSETS));
}

function printAssets(SimpleXMLElement $parent)
{
    $html = "<ul>n";
    foreach ($parent->asset as $asset)
    {
        $html .= printAsset($asset);
    }
    $html .= "</ul>n";

    return $html;
}

function printAsset(SimpleXMLElement $asset)
{
    $html = '<li id="asset'.$asset->asset_assetid.'"><ins>&nbsp;</ins><a href="#">'.$asset->asset_name.' ['.$asset->asset_assetid.']</a>';

    if (isset($asset->asset))
    {
        // has <asset/> children
        $html .= printAssets($asset);
    }

    $html .= "</li>n";

    return $html;
}

By the way, I would expect a function named "printX" to actually print or echo something, rather than return it. Perhaps you should name those functions "buildX" ?

Saturday, September 17, 2022
 
shdr
 
4

Using recursion, you can create a brand new document based on the input, solving all your points at once:

Code

<?php

$input = file_get_contents('http://www.fluffyduck.com.au/sampleXML.xml');
$inputDoc = new DOMDocument();
$inputDoc->loadXML($input);

$outputDoc = new DOMDocument("1.0", "utf-8");
$outputDoc->appendChild($outputDoc->createElement("root"));

function ConvertUserToItem($outputDoc, $inputNode, $outputNode)
{
    if ($inputNode->hasChildNodes())
    {
        foreach ($inputNode->childNodes as $inputChild)
        {
            if (strtolower($inputChild->nodeName) == "user")
            {
                $outputChild = $outputDoc->createElement("item");
                $outputNode->appendChild($outputChild);
                // read input attributes and convert them to nodes
                if ($inputChild->hasAttributes())
                {
                    $outputContent = $outputDoc->createElement("content");
                    foreach ($inputChild->attributes as $attribute)
                    {
                        if (strtolower($attribute->name) != "id")
                        {
                            $outputContent->appendChild($outputDoc->createElement($attribute->name, $attribute->value));
                        }
                        else
                        {
                            $outputChild->setAttribute($attribute->name, $attribute->value);
                        }
                    }               
                    $outputChild->appendChild($outputContent);
                }
                // recursive call
                ConvertUserToItem($outputDoc, $inputChild, $outputChild);
            }
        }
    }
}

ConvertUserToItem($outputDoc, $inputDoc->documentElement, $outputDoc->documentElement);

header("Content-Type: text/xml; charset=" . $outputDoc->encoding);
echo $outputDoc->saveXML();
?>

Output

<?xml version="1.0" encoding="utf-8"?>
<root>
    <item id="41">
        <content>
            <username>bsmain</username>
            <firstname>Boss</firstname>
            <lastname>MyTest</lastname>
            <fullname>Test Name</fullname>
            <email>lalal@test.com</email>
            <logins>1964</logins>
            <lastseen>11/09/2012</lastseen>
        </content>
        <item id="61">
            <content>
                <username>underling</username>
                <firstname>Under</firstname>
                <lastname>MyTest</lastname>
                <fullname>Test Name</fullname>
                <email>lalal@test.com</email>
                <logins>4</logins>
                <lastseen>08/09/2009</lastseen>
            </content>
        </item>
...
Friday, December 9, 2022
 
petele
 
4

I got rid of the parse error by adding a header:

header('Content-type: text/xml');
$xml = new XMLWriter('UTF-8', '1.0'); 
$xml->openURI('php://output'); 
$xml->setIndent(true);
$xml->startDocument(); 

$xml->startElement("XML"); 
while ( $stmt->fetch() ) {
    $xml->startElement("ITEM");
    $xml->writeElement('ELEMENT', $reviewdate);
    $xml->endElement(); // </ITEM>
}
$stmt->close();
$xml->endElement(); // xml

$xml->flush();

unset($xml); 
Friday, August 12, 2022
 
rem.co
 
3

There is another way, which will need two steps but don't need you to treat the XML as string anywhere in the process :

declare @result XML =
(
    SELECT 
        'Test' AS Test,
        'SomeMore' AS SomeMore
    FOR XML PATH('TestPath')
)
set @result.modify('
    insert <?xml-stylesheet type="text/xsl" href="stylesheet.xsl"?>
    before /*[1]
')

Sqlfiddle Demo

The XQuery expression passed to modify() function tells SQL Server to insert the processing instruction node before the root element of the XML.

UPDATE :

Found another alternative based on the following thread : Merge the two xml fragments into one? . I personally prefer this way :

SELECT CONVERT(XML, '<?xml-stylesheet type="text/xsl" href="stylesheet.xsl"?>'),
(
    SELECT 
        'Test' AS Test,
        'SomeMore' AS SomeMore
    FOR XML PATH('TestPath')
)
FOR XML PATH('')

Sqlfiddle Demo

Saturday, October 15, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :