Viewed   222 times

In RegEx, I want to find the tag and everything between two XML tags, like the following:

    <addressLine>280 Flinders Mall</addressLine>

I want to find the tag and everything between primaryAddress, and erase that.

Everything between the primaryAddress tag is a variable, but I want to remove the entire tag and sub-tags whenever I get primaryAddress.

Anyone have any idea how to do that?



It is not a good idea to use regex for HTML/XML parsing...

However, if you want to do it anyway, search for regex pattern


and replace it with empty string...

Sunday, December 11, 2022

This gets the whole table. But it can be modified to let it grab another tag. This is quite a case specific solution which can only be used onder specific circumstances. Breaks if html, php or css comments containt the opening or closing tag. Use it with caution.


// **********************************************************************************
// Gets a whole html tag with its contents.
//  - Source should be a well formatted html string (get it with file_get_contents or cURL)
//  - You CAN provide a custom startTag with in it e.g. an id or something else (<table style='border:0;')
//    This is recommended if it is not the only p/table/h2/etc. tag in the script.
//  - Ignores closing tags if there is an opening tag of the same sort you provided. Got it?
function getTagWithContents($source, $tag, $customStartTag = false)

    $startTag = '<'.$tag;
    $endTag   = '</'.$tag.'>';

    $startTagLength = strlen($startTag);
    $endTagLength   = strlen($endTag);

//      ***************************** 
    if ($customStartTag)
        $gotStartTag = strpos($source, $customStartTag);
        $gotStartTag = strpos($source, $startTag);

    // Can't find it?
    if (!$gotStartTag)
        return false;       

//      ***************************** 

        // This is the hard part: finding the correct closing tag position.
        // <table class="schedule">
        //     <table>
        //     </table> <-- Not this one
        // </table> <-- But this one

        $foundIt          = false;
        $locationInScript = $gotStartTag;
        $startPosition    = $gotStartTag;

        // Checks if there is an opening tag before the start tag.
        while ($foundIt == false)
            $gotAnotherStart = strpos($source, $startTag, $locationInScript + $startTagLength);
            $endPosition        = strpos($source, $endTag,   $locationInScript + $endTagLength);

            // If it can find another opening tag before the closing tag, skip that closing tag.
            if ($gotAnotherStart && $gotAnotherStart < $endPosition)
                $locationInScript = $endPosition;
                $foundIt  = true;
                $endPosition = $endPosition + $endTagLength;

//      ***************************** 

        // cut the piece from its source and return it.
        return substr($source, $startPosition, ($endPosition - $startPosition));


Application of the function:

$gotTable = getTagWithContents($tableData, 'table', '<table class="schedule"');
if (!$gotTable)
    $error = 'Faild to log in or to get the tag';
    //Do something you want to do with it, e.g. display it or clean it...
    $cleanTable = preg_replace('|href='(.*)'|', '', $gotTable);
    $cleanTable = preg_replace('|TITLE="(.*)"|', '', $cleanTable);

Above you can find my final solution to my problem. Below the old solution out of which I made a function for universal use.

Old solution:

// Try to find the table and remember its starting position. Check for succes.
// No success means the user is not logged in.
$gotTableStart = strpos($source, '<table class="schedule"');
if (!$gotTableStart)
    $err = 'Can't find the table start';

//      ***************************** 
    // This is the hard part: finding the closing tag.
    $foundIt          = false;
    $locationInScript = $gotTableStart;
    $tableStart       = $gotTableStart;

    while ($foundIt == false)
        $innerTablePos = strpos($source, '<table', $locationInScript + 6);
        $tableEnd      = strpos($source, '</table>', $locationInScript + 7);

        // If it can find '<table' before '</table>' skip that closing tag.
        if ($innerTablePos != false && $innerTablePos < $tableEnd)
            $locationInScript = $tableEnd;
            $foundIt  = true;
            $tableEnd = $tableEnd + 8;

//      ***************************** 

    // Clear the table from links and popups...
    $rawTable   = substr($tableData, $tableStart, ($tableEnd - $tableStart));

Wednesday, October 26, 2022

For this PHP regex:

$str = preg_replace ( '{(.)1+}', '$1', $str );
$str = preg_replace ( '{[ '-_()]}', '', $str )

In Java:

str = str.replaceAll("(.)\1+", "$1");
str = str.replaceAll("[ '-_\(\)]", "");

I suggest you to provide your input and expected output then you will get better answers on how it can be done in PHP and/or Java.

Sunday, October 9, 2022


NSString *xml = @"<?xml version="1.0" encoding="UTF-8" standalone="yes"?><badgeCount>6</badgeCount><rank>2</rank><screenName>myName</screenName>";
NSString *pattern = @"<badgeCount>(\d+)</badgeCount>";

NSRegularExpression *regex = [NSRegularExpression
NSTextCheckingResult *textCheckingResult = [regex firstMatchInString:xml options:0 range:NSMakeRange(0, xml.length)];

NSRange matchRange = [textCheckingResult rangeAtIndex:1];
NSString *match = [xml substringWithRange:matchRange];
NSLog(@"Found string '%@'", match);

NSLog output:

Found string '6'
Friday, September 9, 2022

Once you've extracted the embedded XML document, you should use a proper XML parser.

use XML::LibXML qw( );

my $xml_doc = XML::LibXML->new->parse_string($xml);

for my $key_node ($xml_doc->findnodes("/localconfig/key")) {
   my $key = $key_node->getAttribute("name");
   my $val = $key_node->findvalue("value/text()");
   say "$key: $val";

So that leaves us with the question how to extract the XML document.

Option 1: XML::LibXML

You could use XML::LibXML and simply tell it to ignore the error (the spurious </p> tag).

my $html_doc = XML::LibXML->new( recover => 2 )->parse_html_fh($html);
my $xml = encode_utf8( $html_doc->findvalue('/html/body/pre/text()') =~ s/^[^<]*//r );

Option 2: Regex Match

You could probably get away with using a regex pattern match.

use HTML::Entities qw( decode_entities );

my $xml = decode_entities( ( $html =~ m{<pre>[^&]*(.*?)</pre>}s )[0] );

Option 3: Mojo::DOM

You could use Mojo::DOM to extract the embedded XML document.

use Encode    qw( decode encode_utf8 );
use Mojo::DOM qw( );

my $decoded_html = decode($encoding, $html);
my $html_doc = Mojo::DOM->new($decoded_html);    
my $xml = encode_utf8( $html_doc->at('html > body > pre')->text =~ s/^[^<]*//r );

The problem with Mojo::DOM is that you need to know the encoding of the document before you pass the document to the parser (because you must pass it decoded), but you need to parse the document in order to extract the encoding of the document form the document.

(Of course, you could use Mojo::DOM to parse the XML too.)

Note that the HTML fragment <p><pre></pre></p> means <p></p><pre></pre>, and both XML::LibXML and Mojo::DOM handle this correctly.

Saturday, August 6, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :