Viewed   171 times

I have this data in a LONGTEXT column (so the line breaks are retained):

Paragraph one
Paragraph two
Paragraph three
Paragraph four

I'm trying to match paragraph 1 through 3. I'm using this code:

preg_match('/Para(.*)three/', $row['file'], $m);

This returns nothing. If I try to work just within the first line of the paragraph, by matching:

preg_match('/Para(.*)one/', $row['file'], $m);

Then the code works and I get the proper string returned. What am I doing wrong here?

 Answers

5

Use s modifier.

preg_match('/Para(.*)three/s', $row['file'], $m);

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

Saturday, November 5, 2022
2

This issue can be solved partially (if not completely) with a custom nl2br() function:

function nl2br_special($string){

    // Step 1: Add <br /> tags for each line-break
    $string = nl2br($string); 

    // Step 2: Remove the actual line-breaks
    $string = str_replace("n", "", $string);
    $string = str_replace("r", "", $string);

    // Step 3: Restore the line-breaks that are inside <pre></pre> tags
    if(preg_match_all('/<pre>(.*?)</pre>/', $string, $match)){
        foreach($match as $a){
            foreach($a as $b){
            $string = str_replace('<pre>'.$b.'</pre>', "<pre>".str_replace("<br />", PHP_EOL, $b)."</pre>", $string);
            }
        }
    }

    // Step 4: Removes extra <br /> tags

    // Before <pre> tags
    $string = str_replace("<br /><br /><br /><pre>", '<br /><br /><pre>', $string);
    // After </pre> tags
    $string = str_replace("</pre><br /><br />", '</pre><br />', $string);

    // Arround <ul></ul> tags
    $string = str_replace("<br /><br /><ul>", '<br /><ul>', $string);
    $string = str_replace("</ul><br /><br />", '</ul><br />', $string);
    // Inside <ul> </ul> tags
    $string = str_replace("<ul><br />", '<ul>', $string);
    $string = str_replace("<br /></ul>", '</ul>', $string);

    // Arround <ol></ol> tags
    $string = str_replace("<br /><br /><ol>", '<br /><ol>', $string);
    $string = str_replace("</ol><br /><br />", '</ol><br />', $string);
    // Inside <ol> </ol> tags
    $string = str_replace("<ol><br />", '<ol>', $string);
    $string = str_replace("<br /></ol>", '</ol>', $string);

    // Arround <li></li> tags
    $string = str_replace("<br /><li>", '<li>', $string);
    $string = str_replace("</li><br />", '</li>', $string);

    return $string;
}

This must be applied to the content before it is HTML-Purified. Never re-process a purified content, unless you know what you're doing.

Please note that because each line-break and double line-breaks are already kept, you should not use the AutoFormat.AutoParagraph feature of HTML Purifier:

// Process line-breaks
$string = nl2br_special($string);

// Initiate HTML Purifier config
$purifier_config = HTMLPurifier_Config::createDefault();
$purifier_config->set('HTML.Allowed', 'p,ul,ol,li,strong,b,em,i,u,a[href],code,pre,blockquote,cite,img[src|alt],br,hr,h3,h4');
//$purifier_config->set('AutoFormat.AutoParagraph', true); // Make sure to NOT use this

// Initiate HTML Purifier
$purifier = new HTMLPurifier($purifier_config);

// Purify the content!
$string = $purifier->purify($string);

That's it!


Furthermore, because allowing basic HTML tags was originally intended to improve user experience by not adding another markup syntax, you might want to allow users to post code, and especially HTML code, which would not be interpreted/removed by HTML Purifier.

HTML Purifier currently allows to post code but requires complex CDATA markers:

<![CDATA[
Place code here
]]>

Hard to remember and to write. To simplify the user experience as much as possible I believe it is best to allow users to add code by embedding it with simple <code> (for inline code) and <pre> (for blocks of code) tags. Here is how to do that:

function custom_code_tag_callback($code) {

    return '<code>'.trim(htmlspecialchars($code[1])).'</code>';
}
function custom_pre_tag_callback($code) {

    return '<pre><code>'.trim(htmlspecialchars($code[1])).'</code></pre>';
}

// Don't require HTMLPurifier's CDATA enclosing, instead allow simple <code> or <pre> tags
$string = preg_replace_callback("/<code>(.*?)</code>/is", 'custom_code_tag_callback', $string);
$string = preg_replace_callback("/<pre>(.*?)</pre>/is", 'custom_pre_tag_callback', $string);

Note that like the nl2br processing, it must be done before the content is HTML Purified. Also, keep in mind that if the user puts <code> or <pre> tags in his own posted code, then it will close the parent <code> or <pre> tag enclosing his code. This cannot be solved, and also applies with the original CDATA markers or with any markup, even the one used on (for example using the ` symbol in a code sample will close the code tag).

Finally, for a great user experience there are other things that we might want to automate like for example the links which we want to be made clickable. Luckily this can be done by HTML Purifier AutoFormat.Linkify feature.

Here is the final code that includes everything for an ultimate setup:

// === Declare functions ===

function nl2br_special($string){

    // Step 1: Add <br /> tags for each line-break
    $string = nl2br($string); 

    // Step 2: Remove the actual line-breaks
    $string = str_replace("n", "", $string);
    $string = str_replace("r", "", $string);

    // Step 3: Restore the line-breaks that are inside <pre></pre> tags
    if(preg_match_all('/<pre>(.*?)</pre>/', $string, $match)){
        foreach($match as $a){
            foreach($a as $b){
            $string = str_replace('<pre>'.$b.'</pre>', "<pre>".str_replace("<br />", PHP_EOL, $b)."</pre>", $string);
            }
        }
    }

    // Step 4: Removes extra <br /> tags

    // Before <pre> tags
    $string = str_replace("<br /><br /><br /><pre>", '<br /><br /><pre>', $string);
    // After </pre> tags
    $string = str_replace("</pre><br /><br />", '</pre><br />', $string);

    // Arround <ul></ul> tags
    $string = str_replace("<br /><br /><ul>", '<br /><ul>', $string);
    $string = str_replace("</ul><br /><br />", '</ul><br />', $string);
    // Inside <ul> </ul> tags
    $string = str_replace("<ul><br />", '<ul>', $string);
    $string = str_replace("<br /></ul>", '</ul>', $string);

    // Arround <ol></ol> tags
    $string = str_replace("<br /><br /><ol>", '<br /><ol>', $string);
    $string = str_replace("</ol><br /><br />", '</ol><br />', $string);
    // Inside <ol> </ol> tags
    $string = str_replace("<ol><br />", '<ol>', $string);
    $string = str_replace("<br /></ol>", '</ol>', $string);

    // Arround <li></li> tags
    $string = str_replace("<br /><li>", '<li>', $string);
    $string = str_replace("</li><br />", '</li>', $string);

    return $string;
}


function custom_code_tag_callback($code) {

    return '<code>'.trim(htmlspecialchars($code[1])).'</code>';
}

function custom_pre_tag_callback($code) {

    return '<pre><code>'.trim(htmlspecialchars($code[1])).'</code></pre>';
}



// === Process user's input ===

// Process line-breaks
$string = nl2br_special($string);

// Allow simple <code> or <pre> tags for posting code
$string = preg_replace_callback("/<code>(.*?)</code>/is", 'custom_code_tag_callback', $string);
$string = preg_replace_callback("/<pre>(.*?)</pre>/is", 'custom_pre_tag_callback', $string);


// Initiate HTML Purifier config
$purifier_config = HTMLPurifier_Config::createDefault();
$purifier_config->set('HTML.Allowed', 'p,ul,ol,li,strong,b,em,i,u,a[href],code,pre,blockquote,cite,img[src|alt],br,hr,h3,h4');
$purifier_config->set('AutoFormat.Linkify', true); // Make links clickable
//$purifier_config->set('HTML.TargetBlank', true); // Uncomment if you want links to open new tabs
//$purifier_config->set('AutoFormat.AutoParagraph', true); // Leave this commented as it conflicts with nl2br


// Initiate HTML Purifier
$purifier = new HTMLPurifier($purifier_config);

// Purify the content!
$string = $purifier->purify($string);

Cheers!

Monday, November 21, 2022
 
tariq
 
4

You are missing the /ims flag at the end of your regex. Otherwise . will not match line breaks (as in your first paragraph). Actually /s would suffice, but I'm always using all three for simplicity.

Also, preg_match works for many simple cases. But if you are attempting any more complex extractions, then consider alternating to phpQuery or QueryPath which allow for:

foreach (qp($html)->find("p") as $p)  { print $p->text(); }
Friday, September 23, 2022
 
1

Use preg_quote to quote regular expression characters.

Like this:

preg_quote($theKeyword, '/');

Where '/' is the delimiter in your regular expression.

Friday, November 25, 2022
 
3
preg_match('~"http://(.*)"~iU', $code, $matches);

Your issue was you need delimiters (I chose ~) to use with the pattern. See the preg_match() man page for more information.

Saturday, December 10, 2022
 
5
  • n is a Linux/Unix line break.
  • r is a classic Mac OS (non-OS X) line break. Mac OS X uses the above unix n.
  • rn is a Windows line break.

I usually just use n on our Linux systems and most Windows apps deal with it ok anyway.

Friday, November 25, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :