Viewed   67 times

Situation is a string that results in something like this:

<p>This is some text and here is a <strong>bold text then the post stop here....</p>

Because the function returns a teaser (summary) of the text, it stops after certain words. Where in this case the tag strong is not closed. But the whole string is wrapped in a paragraph.

Is it possible to convert the above result/output to the following:

<p>This is some text and here is a <strong>bold text then the post stop here....</strong></p>

I do not know where to begin. The problem is that.. I found a function on the web which does it regex, but it puts the closing tag after the string.. therefore it won't validate because I want all open/close tags within the paragraph tags. The function I found does this which is wrong also:

<p>This is some text and here is a <strong>bold text then the post stop here....</p></strong>

I want to know that the tag can be strong, italic, anything. That's why I cannot append the function and close it manually in the function. Any pattern that can do it for me?

 Answers

5

Here is a function i've used before, which works pretty well:

function closetags($html) {
    preg_match_all('#<(?!meta|img|br|hr|inputb)b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
    $openedtags = $result[1];
    preg_match_all('#</([a-z]+)>#iU', $html, $result);
    $closedtags = $result[1];
    $len_opened = count($openedtags);
    if (count($closedtags) == $len_opened) {
        return $html;
    }
    $openedtags = array_reverse($openedtags);
    for ($i=0; $i < $len_opened; $i++) {
        if (!in_array($openedtags[$i], $closedtags)) {
            $html .= '</'.$openedtags[$i].'>';
        } else {
            unset($closedtags[array_search($openedtags[$i], $closedtags)]);
        }
    }
    return $html;
} 

Personally though, I would not do it using regexp but a library such as Tidy. This would be something like the following:

$str = '<p>This is some text and here is a <strong>bold text then the post stop here....</p>';
$tidy = new Tidy();
$clean = $tidy->repairString($str, array(
    'output-xml' => true,
    'input-xml' => true
));
echo $clean;
Sunday, July 31, 2022
2

You can break each data by - and build the array in as much as needed. Notice the use of & in the code as using reference to result array.

Example:

$str = "15-02-01-0000,15-02-02-0000,15-02-03-0000,15-02-04-0000,15-02-05-0000,15-02-10-0000,15-02-10-9100,15-02-10-9101,15-15-81-0000,15-15-81-0024";
$arr = explode(",", $str);
$res = [];
foreach($arr as $e) { // for each line in your data
    $a = explode("-", $e); //break to prefix
    $current = &$res; 
    while(count($a) > 1) { // create the array to that specific place if needed
        $key = array_shift($a); // take the first key
        if (!isset($current[$key])) // if the path not exist yet create empty array 
            $current[$key] = array();
        $current = &$current[$key];
    }
    $current[] = $e; // found the right path so add the element
}

The full result will be in $res.

Monday, October 10, 2022
5

correct me if i'm wrong, but i don't think you can do this with a simple regexp. in a full regexp implementation you could use something like this :

$parts = preg_split("/(?<!<[^>]*)./", $input);

but php does not allow non-fixed-length lookbehind, so that won't work. apparently the only 2 that do are jgsoft and the .net regexp. Useful Page

my method of dealing with this would be :

function splitStringUp($input, $maxlen) {
    $parts = explode(".", $input);
    $i = 0;
    while ($i < count($parts)) {
        if (preg_match("/<[^>]*$/", $parts[$i])) {
            array_splice($parts, $i, 2, $parts[$i] . "." . $parts[$i+1]);
        } else {
            if ($i < (count($parts) - 1) && strlen($parts[$i] . "." . $parts[$i+1]) < $maxlen) {
                array_splice($parts, $i, 2, $parts[$i] . "." . $parts[$i+1]);
            } else {
                $i++;
            }
        }
    }
    return $parts;
}

you didn't mention what you want to happen when an individual sentence is >8000 chars long, so this just leaves them intact.

sample output :

splitStringUp("this is a sentence. this is another sentence. this is an html <a href="a.b.c">tag. and the closing tag</a>. hooray", 8000);
array(1) {
  [0]=> string(114) "this is a sentence. this is another sentence. this is an html <a href="a.b.c">tag. and the closing tag</a>. hooray"
}

splitStringUp("this is a sentence. this is another sentence. this is an html <a href="a.b.c">tag. and the closing tag</a>. hooray", 80);
array(2) {
  [0]=> string(81) "this is a sentence. this is another sentence. this is an html <a href="a.b.c">tag"
  [1]=> string(32) " and the closing tag</a>. hooray"
}

splitStringUp("this is a sentence. this is another sentence. this is an html <a href="a.b.c">tag. and the closing tag</a>. hooray", 40);
array(4) {
  [0]=> string(18) "this is a sentence"
  [1]=> string(25) " this is another sentence"
  [2]=> string(36) " this is an html <a href="a.b.c">tag"
  [3]=> string(32) " and the closing tag</a>. hooray"
}

splitStringUp("this is a sentence. this is another sentence. this is an html <a href="a.b.c">tag. and the closing tag</a>. hooray", 0);
array(5) {
  [0]=> string(18) "this is a sentence"
  [1]=> string(25) " this is another sentence"
  [2]=> string(36) " this is an html <a href="a.b.c">tag"
  [3]=> string(24) " and the closing tag</a>"
  [4]=> string(7) " hooray"
}
Thursday, December 15, 2022
 
dvlsg
 
4

There is a UDF that will do that described here:

User Defined Function to Strip HTML

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX) AS
BEGIN
    DECLARE @Start INT
    DECLARE @End INT
    DECLARE @Length INT
    SET @Start = CHARINDEX('<',@HTMLText)
    SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
    SET @Length = (@End - @Start) + 1
    WHILE @Start > 0 AND @End > 0 AND @Length > 0
    BEGIN
        SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
        SET @Start = CHARINDEX('<',@HTMLText)
        SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
        SET @Length = (@End - @Start) + 1
    END
    RETURN LTRIM(RTRIM(@HTMLText))
END
GO

Edit: note this is for SQL Server 2005, but if you change the keyword MAX to something like 4000, it will work in SQL Server 2000 as well.

Sunday, September 11, 2022
 
5

You can use a simple regex like this:

public static string StripHTML(string input)
{
   return Regex.Replace(input, "<.*?>", String.Empty);
}

Be aware that this solution has its own flaw. See Remove HTML tags in String for more information (especially the comments of @mehaase)

Another solution would be to use the HTML Agility Pack.
You can find an example using the library here: HTML agility pack - removing unwanted tags without removing content?

Wednesday, November 16, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :