Asked  2 Years ago    Answers:  5   Viewed   66 times

I have a simple XML document:

<?xml version="1.0"?>
<cellphones>
  <telefon>
    <model>Easy DB</model>
    <proizvodjac>Alcatel</proizvodjac>
    <cena>25</cena>
  </telefon>
  <telefon>
    <model>3310</model>
    <proizvodjac>Nokia</proizvodjac>
    <cena>30</cena>
  </telefon>
  <telefon>
    <model>GF768</model>
    <proizvodjac>Ericsson</proizvodjac>
    <cena>15</cena>
  </telefon>
  <telefon>
    <model>Skeleton</model>
    <proizvodjac>Panasonic</proizvodjac>
    <cena>45</cena>
  </telefon>
  <telefon>
    <model>Earl</model>
    <proizvodjac>Sharp</proizvodjac>
    <cena>60</cena>
  </telefon>
</cellphones>

I need to print the content of this file using XML DOM, and it needs to be structured like this:

"model: Easy DB
proizvodjac: Alcatel
cena: 25"

for each node inside the XML.

IT HAS TO BE DONE using XML DOM. That's the problem. I can do it the usual, simple way. But this one bothers me because I can't seem to find any solution on the internet.

This is as far as I can go, but I need to access inside nodes (child nodes) and to get node values. I also want to get rid of some weird string "#text" that comes up out of the blue.

<?php
    //kreira se DOMDocument objekat
    $xmlDoc = new DOMDocument();

    //u xml objekat se ucitava xml fajl
    $xmlDoc->load("poruke.xml");

    //dodeljuje se promenljivoj koreni element
    $x = $xmlDoc->documentElement;

    //prolazi se kroz petlju tako sto se ispisuje informacija o podelementima
    foreach ($x->childNodes AS $item){
        print $item->nodeName . " = " . $item->nodeValue . "<br />";
    }
?>

Thanks

 Answers

4

Explanation for weird #text strings

The weird #text strings dont come out of the blue but are actual Text Nodes. When you load a formatted XML document with DOM any whitespace, e.g. indenting, linebreaks and node values will be part of the DOM as DOMText instances by default, e.g.

<cellphones>nt<telefon>ntt<model>Easy DB…
E           T   E        T     E      T      

where E is a DOMElement and T is a DOMText.

To get around that, load the document like this:

$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->load('file.xml');

Then your document will be structured as follows

<cellphones><telefon><model>Easy DB…
E           E        E      T

Note that individual nodes representing the value of a DOMElement will still be DOMText instances, but the nodes that control the formatting are gone. More on that later.

Proof

You can test this easily with this code:

$dom = new DOMDocument;
$dom->preserveWhiteSpace = TRUE; // change to FALSE to see the difference
$dom->load('file.xml');
foreach ($dom->getElementsByTagName('telefon') as $telefon) {
    foreach($telefon->childNodes as $node) {
        printf(
            "Name: %s - Type: %s - Value: %sn",
            $node->nodeName,
            $node->nodeType,
            urlencode($node->nodeValue)
        );
    }
}

This code runs through all the telefon elements in your given XML and prints out node name, type and the urlencoded node value of it's child nodes. When you preserve the whitespace, you will get something like

Name: #text - Type: 3 - Value: %0A++++
Name: model - Type: 1 - Value: Easy+DB
Name: #text - Type: 3 - Value: %0A++++
Name: proizvodjac - Type: 1 - Value: Alcatel
Name: #text - Type: 3 - Value: %0A++++
Name: cena - Type: 1 - Value: 25
Name: #text - Type: 3 - Value: %0A++
…

The reason I urlencoded the value is to show that there is in fact DOMText nodes containing the indenting and the linebreaks in your DOMDocument. %0A is a linebreak, while each + is a space.

When you compare this with your XML, you will see there is a line break after each <telefon> element followed by four spaces until the <model> element starts. Likewise, there is only a newline and two spaces between the closing <cena> and the opening <telefon>.

The given type for these nodes is 3, which - according to the list of predefined constants - is XML_TEXT_NODE, e.g. a DOMText node. In lack of a proper element name, these nodes have a name of #text.

Disregarding Whitespace

Now, when you disable preservation of whitespace, the above will output:

Name: model - Type: 1 - Value: Easy+DB
Name: proizvodjac - Type: 1 - Value: Alcatel
Name: cena - Type: 1 - Value: 25
Name: model - Type: 1 - Value: 3310
…

As you can see, there is no more #text nodes, but only type 1 nodes, which means XML_ELEMENT_NODE, e.g. DOMElement.

DOMElements contain DOMText nodes

In the beginning I said, the values of DOMElements are DOMText instances too. But in the output above, they are nowhere to be seen. That's because we are accessing the nodeValue property, which returns the value of the DOMText as string. We can prove that the value is a DOMText easily though:

$dom = new DOMDocument;
$dom->preserveWhiteSpace = FALSE;
$dom->loadXML($xml);
foreach ($dom->getElementsByTagName('telefon') as $telefon) {
    $node = $telefon->firstChild->firstChild; // 1st child of model
    printf(
        "Name: %s - Type: %s - Value: %sn",
        $node->nodeName,
        $node->nodeType,
        urlencode($node->nodeValue)
    );
}

will output

Name: #text - Type: 3 - Value: Easy+DB
Name: #text - Type: 3 - Value: 3310
Name: #text - Type: 3 - Value: GF768
Name: #text - Type: 3 - Value: Skeleton
Name: #text - Type: 3 - Value: Earl

And this proves a DOMElement contains it's value as a DOMText and nodeValue is just returning the content of the DOMText directly.

More on nodeValue

In fact, nodeValue is smart enough to concatenate the contents of any DOMText children:

$dom = new DOMDocument;
$dom->loadXML('<root><p>Hello <em>World</em>!!!</p></root>');
$node = $dom->documentElement->firstChild; // p
printf(
    "Name: %s - Type: %s - Value: %sn",
    $node->nodeName,
    $node->nodeType,
    $node->nodeValue
);

will output

Name: p - Type: 1 - Value: Hello World!!!

although these are really the combined values of

DOMText "Hello"
DOMElement em with DOMText "World"
DOMText "!!!"

Printing content of a XML file using XML DOM

To finally answer your question, look at the first test code. Everything you need is in there. And of course by now you have been given fine other answers too.

Wednesday, October 26, 2022
2

probably an answer here: (if protocol is the issue)

How to get file_get_contents() to work with HTTPS?

Tuesday, August 30, 2022
5

You need to import any node to append it to another document:

$departmentArray->item($i)->appendChild( $doc->importNode( $employee, true ) );
Thursday, December 22, 2022
5

I think that what you need is actually an XPath expression. You could configure that expression in some property file or whatever you use to retrieve your setup parameters.

In this way, you'd just change the XPath expression whenever your customer hides away the info you use in yet another place.

Basically, an XSLT is an overkill, you just need an XPath expression. A single XPath expression will allow to home in onto each value you are after.

Update

Since we are now talking about JDK 1.4 I've included below 3 different ways of fetching text in an XML file using XPath. (as simple as possible, no NPE guard fluff I'm afraid ;-)

Starting from the most up to date.

0. First the sample XML config file

<?xml version="1.0" encoding="UTF-8"?>
<config>
    <param id="MaxThread" desc="MaxThread"        type="int">250</param>
    <param id="rTmo"      desc="RespTimeout (ms)" type="int">5000</param>
</config>

1. Using JAXP 1.3 standard part of Java SE 5.0

import javax.xml.parsers.*;
import javax.xml.xpath.*;
import org.w3c.dom.Document;

public class TestXPath {

    private static final String CFG_FILE = "test.xml" ;
    private static final String XPATH_FOR_PRM_MaxThread = "/config/param[@id='MaxThread']/text()";
    public static void main(String[] args) {

        DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
        docFactory.setNamespaceAware(true);
        DocumentBuilder builder;
        try {
            builder = docFactory.newDocumentBuilder();
            Document doc = builder.parse(CFG_FILE);
            XPathExpression expr = XPathFactory.newInstance().newXPath().compile(XPATH_FOR_PRM_MaxThread);
            Object result = expr.evaluate(doc, XPathConstants.NUMBER);
            if ( result instanceof Double ) {
                System.out.println( ((Double)result).intValue() );
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

2. Using JAXP 1.2 standard part of Java SE 1.4-2

import javax.xml.parsers.*;
import org.apache.xpath.XPathAPI;
import org.w3c.dom.*;

public class TestXPath {

    private static final String CFG_FILE = "test.xml" ;
    private static final String XPATH_FOR_PRM_MaxThread = "/config/param[@id='MaxThread']/text()";

    public static void main(String[] args) {

        try {
            DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
            docFactory.setNamespaceAware(true);
            DocumentBuilder builder = docFactory.newDocumentBuilder();
            Document doc = builder.parse(CFG_FILE);
            Node param = XPathAPI.selectSingleNode( doc, XPATH_FOR_PRM_MaxThread );
            if ( param instanceof Text ) {
                System.out.println( Integer.decode(((Text)(param)).getNodeValue() ) ); 
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

3. Using JAXP 1.1 standard part of Java SE 1.4 + jdom + jaxen

You need to add these 2 jars (available from www.jdom.org - binaries, jaxen is included).

import java.io.File;
import org.jdom.*;
import org.jdom.input.SAXBuilder;
import org.jdom.xpath.XPath;

public class TestXPath {

    private static final String CFG_FILE = "test.xml" ;
    private static final String XPATH_FOR_PRM_MaxThread = "/config/param[@id='MaxThread']/text()";

    public static void main(String[] args) {
        try {
            SAXBuilder sxb = new SAXBuilder();
            Document doc = sxb.build(new File(CFG_FILE));
            Element root = doc.getRootElement();
            XPath xpath = XPath.newInstance(XPATH_FOR_PRM_MaxThread);
            Text param = (Text) xpath.selectSingleNode(root);
            Integer maxThread = Integer.decode( param.getText() );
            System.out.println( maxThread );
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
Sunday, November 6, 2022
 
1

&#13; is the Carriage Return part of a rn style line ending. I think DOMDocument encodes it to preserve it. If you check the XML specification it says that it will get normalized to n if not encoded.

So you have different options:

  1. Ignore the escaped entities, they get decoded in the xml parser
  2. Use CDATA-Elements, the normalization is not done here, so DOMDocument sees no need to escape the "r".
  3. Make sure that you saved your file with n style line endings
  4. Normalize the line endings to n before creating the DOM

Here is some sample source to show the different behaviour:

$text = "test1;rntest2;rntest3;rn";

$dom = new DOMDocument('1.0', 'UTF-8');
$root = $dom->appendChild($root = $dom->createElement('root'));

$root->appendChild(
  $node = $dom->createElement('code')
);
// text node - CR will get escaped
$node->appendChild($dom->createTextNode($text));

$root->appendChild(
  $node = $dom->createElement('code')
);
// cdata - CR will not get escaped
$node->appendChild($dom->createCdataSection($text));

$root->appendChild(
  $node = $dom->createElement('code')
);
// text node, CRLF and CR normalized to LF
$node->appendChild(
  $dom->createTextNode(
    str_replace(array("rn", "r"), "n", $text)
  )
);

$dom->formatOutput = TRUE;
echo $dom->saveXml();

Output:

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <code>test1;&#13;
test2;&#13;
test3;&#13;
</code>
  <code><![CDATA[test1;
test2;
test3;
]]></code>
  <code>test1;
test2;
test3;
</code>
</root>
Friday, August 12, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 

Browse Other Code Languages