Viewed   855 times

This question is intended as a reference to answer a particularly common question, which might take different forms:

  • I have an XML document which contains multiple namespaces; how do I parse it with SimpleXML?
  • My XML has a colon (":") in the tag name, how do I access it with SimpleXML?
  • How do I access attributes in my XML file when they have a colon in their name?

If your question has been closed as a duplicate of this, it may not be identical to these examples, but this page should tell you what you need to know.

Here is an illustrative example:

$xml = '
    <?xml version="1.0" encoding="utf-8"?>
    <document xmlns="http://example.com" xmlns:ns2="https://namespaces.example.org/two" xmlns:seq="urn:example:sequences">
        <list type="short">
            <ns2:item seq:position="1">A thing</ns2:item>
            <ns2:item seq:position="2">Another thing</ns2:item>
        </list>
    </document>
';
$sx = simplexml_load_string($xml);

This code will not work; why not?

foreach ( $sx->list->ns2:item as $item ) {
    echo 'Position: ' . $item['seq:position'] . "n";
    echo 'Item: ' . (string)$item . "n";
}

The first problem is that ->ns2:item is invalid syntax; but changing it to this doesn't work either:

foreach ( $sx->list->{'ns2:item'} as $item ) { ... }

Why not, and what should you use instead?

 Answers

2

What are XML namespaces?

A colon (:) in a tag or attribute name means that the element or attribute is in an XML namespace. Namespaces are a way of combining different XML formats / standards in one document, and keeping track of which names come from which format. The colon, and the part before it, aren't really part of the tag / attribute name, they just indicate which namespace it's in.

An XML namespace has a namespace identifier, which is identified by a URI (a URL or URN). The URI doesn't point at anything, it's just a way for someone to "own" the namespace. For instance, the SOAP standard uses the namespace http://www.w3.org/2003/05/soap-envelope and an OpenDocument file uses (among others) urn:oasis:names:tc:opendocument:xmlns:meta:1.0. The example in the question uses the namespaces http://example.com and https://namespaces.example.org/two.

Within a document, or a section of a document, a namespace is given a local prefix, which is the part you see before the colon. For instance, in different documents, the SOAP namespace might be given the local prefix soap:, SOAP:, SOAP-ENV:, env:, or just ns1:. These names are linked back to the identifier of the namespace using a special xmlns attribute, e.g. xmlns:soap="http://www.w3.org/2003/05/soap-envelope". The choice of prefix in a particular document is completely arbitrary, and could change each time it was generated without changing the meaning.

Finally, there is a default namespace in each document, or section of a document, which is the namespace used for elements with no prefix. It is defined by an xmlns attribute with no :, e.g. xmlns="http://www.w3.org/2003/05/soap-envelope". In the example above, <list> is in the default namespace, which is defined as http://example.com.

Somewhat peculiarly, un-prefixed attributes are never in the default namespace, but in a kind of "void namespace", which the standard doesn't clearly define. See: XML Namespaces and Unprefixed Attributes

SimpleXML gives me an empty object; what's wrong?

If you use print_r, var_dump, or similar "dump structure" functions on a SimpleXML object with namespaces in, some of the contents will not display. It is still there, and can be accessed as described below.

How do you access namespaces in SimpleXML?

SimpleXML provides two main methods for using namespaces:

  • The ->children() method allows you to access child elements in a particular namespace. It effectively switches your object to look at that namespace, until you call it again to switch back, or to another namespace.
  • The ->attributes() method works in a similar way, but allows you to access attributes in a particular namespace.

Both of these methods take the namespace identifier as their first argument. Since these identifiers are rather long, it can be useful to define a constant or variable to represent the namespaces you're working with, so you don't have to copy and paste the full URI everywhere.

For instance, the example above might become:

define('XMLNS_EG2', 'https://namespaces.example.org/two');
define('XMLNS_SEQ', 'urn:example:sequences');
foreach ( $sx->list->children(XMLNS_EG2)->item as $item ) {
    echo 'Position: ' . $item->attributes(XMLNS_SEQ)->position . "n";
    echo 'Item: ' . (string)$item . "n";
}

As a short-hand, you can also pass the methods the local alias of the namespace, by giving the second parameter as true. Remember that this prefix could change at any time, for instance, a generator might assign prefixes ns1, ns2, etc, and assign them in a different order if the code changes slightly. Using this short-hand, the code would become:

foreach ( $sx->list->children('ns2', true)->item as $item ) {
    echo 'Position: ' . $item->attributes('seq', true)->position . "n";
    echo 'Item: ' . (string)$item . "n";
}

(This short-hand was added in PHP 5.2, and you may see really old examples using a more long-winded version using $sx->getNamespaces to get a list of prefix-identifier pairs. This is the worst of both worlds, as you're still hard-coding the prefix rather than the identifier.)

Wednesday, August 31, 2022
1
Great answer Helps Alot.
Monday, September 5, 2022
 
fabito
 
5

Try this

    $xml = new SimpleXmlElement($feed);
    foreach ($xml->entry as $entry)
    {

        $namespaces = $entry->getNameSpaces(true);
        $ai = $entry->children($namespaces['ai']);

        foreach ($ai->ocurrence as $o)
        {
            $date=$o->attributes();
            echo $date['date'];
            echo "<br/>";
        }
    }
Friday, September 30, 2022
2

(Read @ThW's answer about why an array is actually not that important to aim for)

I know it's easy with non-namespaced nodes, but I don't know where to begin on something like this.

It's as easy as with namespaced nodes because technically those are the same. Let's give a quick example, the following script loops over all elements in the document regardless of namespace:

$result = $xml->xpath('//*');
foreach ($result as $element) {
    $depth = count($element->xpath('./ancestor::*'));
    $indent = str_repeat('  ', $depth);
    printf("%s %sn", $indent, $element->getName());
}

The output in your case is:

 message
   header
     response
       result
       gsbStatus
   body
     bodyContent
       attendee
         person
           name
           firstName
           lastName
           address
             addressType
           phones
           email
           type
         contactID
         joinStatus
         meetingKey

As you can see you can iterate over all elements as if they would not have any namespace at all.

But as it has been outlined, when you ignore the namespace you'll also loose important information. For example with the document you have you're actually interested in the attendee and common elements, the service elements deal with the transport:

$uriAtt = 'http://www.webex.com/schemas/2002/06/service/attendee';
$xml->registerXPathNamespace('att', $uriAtt);

$uriCom = 'http://www.webex.com/schemas/2002/06/common';
$xml->registerXPathNamespace('com', $uriCom);

$result = $xml->xpath('//att:*|//com:*');
foreach ($result as $element) {
    $depth  = count($element->xpath("./ancestor::*[namespace-uri(.) = '$uriAtt' or namespace-uri(.) = '$uriCom']"));
    $indent = str_repeat('  ', $depth);
    printf("%s %sn", $indent, $element->getName());
}

The exemplary output this time:

 attendee
   person
     name
     firstName
     lastName
     address
       addressType
     phones
     email
     type
   contactID
   joinStatus
   meetingKey

So why drop all the namespaces? They help you to obtain the elements you're interested in. You can also do it dynamically

Friday, October 14, 2022
 
eddyjs
 
5

Street cred: I worked for Infopop (later known as Groupee, now Social Strata), the creators of UBBCode, the thing that was copied and transformed into just plain old regular "BBCode."

tl;dr: Time to write your own non-regex parser.


Most BBCode parsers use regexes, and that works for most cases, but you're doing something custom here. Plain old regular expressions are not going to help you. Regexes have two modes of operation that get in our way: we can either match everything between two tags in "greedy" mode, or in "not greedy" mode.

In "greedy" mode, we'll capture everything between the very first opening task and the very last closing tag. This breaks things horribly. Take this case:

[a][c]...[/c][/a]...[a]...[/a]

A greedy regex like [a].+[/a] is going to grab everything from that first opening tag to that last closing tag, ignoring the fact that the closer isn't closing the opener.

The other option is worse. Take this case:

[a][a]...[/a][/a]

An ungreedy regex like [a].+?[/a] (the only change is the question mark) is going to match the first opening tag, but then it'll match the first closing tag, again ignoring that the closing tag doesn't belong to the opening tag.

The way I solved this way, way back in the primitive days was to completely ignore the fact that the opening and closing tags didn't match. I simply looped the entire chain of tag transformation regexes until the output stopped changing. It was simple and effective, mainly because the available tag set was intentionally limited, so nesting was never an issue.

The instant you allow nesting of identical tags, blind, brute force is no longer a suitable tool.

If none of the BBCode parsing engines out there are going to work for you, you might have to write your own. Check all of them out. There are some on PEAR, there's a PECL extension, etc. Also check other languages for inspiration, Perl's CPAN has a dozen different implementations, some of which are very powerful and complex (if there isn't a proper recursive descent parser in that mix, I'll be depressed). This is a good challenge, but it's not too hard. Then again, I've written like five now (none of which I can release), so maybe I'm biased?

Start by exploding the string on [ and ]. Go through the resulting array, keeping track of when the array index following the opening bracket and before the next closing bracket happens to look like a valid tag and/or attributes. You're going to need to think about what happens when an attribute can contain a bracket, or worse, are URLs that are bracket-heavy (like PHP array syntax). You'll also need to think about attributes in general, including how (if?) they are quoted, if multiple attributes per tag are allowed (as in your example), and what to do with invalid attributes.

As you continue to process the string, you will also need to keep track of what tags are open, and in what order. You'll have to think about what tags are permitted inside other tags. You'll also have to deal with mis-nesting, like [a][/a]. Your options will be either re-opening the inner tag after the outer closes, or closing the inner as soon as the outer does. Worse, different behavior might make sense depending on the situation. Worse-worse are wacky tags like

  • inside
    • , which traditionally doesn't have a closing tag!

      Once you've processed the string and have created a list of open and closing tags (and possibly re-balanced the opens and closes), then you can transform the result into HTML, or whatever your output ends up being. This is when and how you'd move the output of those specific tags to the end of the new document.

      Once you've finished up, write a thousand test cases. Try to break it, blow it into itty bitty chunks, produce XSS vulnerabilities, and otherwise do your best to make your life hell. It will be worth it, because the result will be a BBCode engine that will do what you're trying to do.

    Wednesday, October 12, 2022
     
    4

    You've got a few problems going on. First you need to specify the namespace in the XPath pattern, the XML isn't well formed (closing tag is not an end tag) and Select-Xml returns XmlInfo and not XmlElement directly. Try this:

    $xml = [xml]@'
    <submission version="2.0" type="TREE" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:noNamespaceSchemaLocation="TREE.xsd" xmlns="some/kind/of/tree/v1">
      <group>
        <item></item>
        <item></item>
        <item></item>
      </group>
    </submission>
    '@
    
    $ns = @{dns="some/kind/of/tree/v1"}
    $items = Select-Xml -Xml $xml -XPath '//dns:item' -Namespace $ns
    $items | Foreach {$_.Node.Name}
    
    Thursday, September 29, 2022
    Only authorized users can answer the search term. Please sign in first, or register a free account.
    Not the answer you're looking for? Browse other questions tagged :