Viewed   67 times

I have random text stored in $sentences. Using regex, I want to split the text into sentences, see:

function splitSentences($text) {
    $re = '/                # Split sentences on whitespace between them.
        (?<=                # Begin positive lookbehind.
          [.!?]             # Either an end of sentence punct,
        | [.!?]['"]        # or end of sentence punct and quote.
        )                   # End positive lookbehind.
        (?<!                # Begin negative lookbehind.
          Mr.              # Skip either "Mr."
        | Mrs.             # or "Mrs.",
        | T.V.A.         # or "T.V.A.",
                            # or... (you get the idea).
        )                   # End negative lookbehind.
        s+                 # Split on whitespace between sentences.
        /ix';

    $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
    return $sentences;
}

$sentences = splitSentences($sentences);

print_r($sentences);

It works fine.

However, it doesn't split into sentences if there are unicode characters:

$sentences = 'Entertainment media properties. Fairy Tail and Tokyo Ghoul.';

Or this scenario:

$sentences = "Entertainment media properties.&Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.";

What can I do to make it work when unicode characters exist in the text?

Here is an ideone for testing.

Bounty info

I am looking for a complete solution to this. Before posting an answer, please read the comment thread I had with WiktorStribi?ew for more relevant info on this issue.

 Answers

5

As it should be expected, any sort of natural language processing is not a trivial task. The reason for it is that they are evolutionary systems. There is no single person who sat down and thought about which are good ideas and which - not. Every rule has 20-40% exceptions. With that said the complexity of a single regex that can do your bidding would be off the charts. Still, the following solution relies mainly on regexes.


  • The idea is to gradually go over the text.
  • At any given time, the current chunk of the text will be contained in two different parts. One, which is the candidate for a substring before a sentence boundary and another - after.
  • The first 10 regex pairs detect positions which look like sentence boundaries, but actually aren't. In that case, before and after are advanced without registering a new sentence.
  • If none of these pairs matches, matching will be attempted with the last 3 pairs, possibly detecting a boundary.

As for where did these regexes come from? - I translated this Ruby library, which is generated based on this paper. If you truly want to understand them, there is no alternative but to read the paper.

As far as accuracy goes - I encourage you to test it with different texts. After some experimentation, I was very pleasantly surprised.

In terms of performance - the regexes should be highly performant as all of them have either a A or Z anchor, there are almost no repetition quantifiers, and in the places there are - there can't be any backtracking. Still, regexes are regexes. You will have to do some benchmarking if you plan to use this is tight loops on huge chunks of text.


Mandatory disclaimer: excuse my rusty php skills. The following code might not be the most idiomatic php ever, it should still be clear enough to get the point across.


function sentence_split($text) {
    $before_regexes = array('/(?:(?:['"„][.!?…]['"”]s)|(?:[^.]s[A-Z].s)|(?:b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd).s)|(?:b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd).s[A-Z].s)|(?:bApr.s)|(?:bAug.s)|(?:bBros.s)|(?:bCo.s)|(?:bCorp.s)|(?:bDec.s)|(?:bDist.s)|(?:bFeb.s)|(?:bInc.s)|(?:bJan.s)|(?:bJul.s)|(?:bJun.s)|(?:bMar.s)|(?:bNov.s)|(?:bOct.s)|(?:bPh.?D.s)|(?:bSept?.s)|(?:bp{Lu}.p{Lu}.s)|(?:bp{Lu}.sp{Lu}.s)|(?:bcf.s)|(?:be.g.s)|(?:besp.s)|(?:betbsbal.s)|(?:bvs.s)|(?:p{Ps}[!?]+p{Pe} ))Z/su',
        '/(?:(?:[.s]p{L}{1,2}.s))Z/su',
        '/(?:(?:[[(]*...[])]* ))Z/su',
        '/(?:(?:b(?:pp|[Vv]iz|i.?s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c.?s*f|vs).s))Z/su',
        '/(?:(?:b[Ee]tc.s))Z/su',
        '/(?:(?:[.!?…]+p{Pe} )|(?:[[(]*…[])]* ))Z/su',
        '/(?:(?:bp{L}.))Z/su',
        '/(?:(?:bp{L}.s))Z/su',
        '/(?:(?:b[Ff]igs?.s)|(?:b[nN]o.s))Z/su',
        '/(?:(?:["”']s*))Z/su',
        '/(?:(?:[.!?…][x{00BB}x{2019}x{201D}x{203A}"'p{Pe}x{0002}]*s)|(?:r?n))Z/su',
        '/(?:(?:[.!?…]['"x{00BB}x{2019}x{201D}x{203A}p{Pe}x{0002}]*))Z/su',
        '/(?:(?:sp{L}[.!?…]s))Z/su');
    $after_regexes = array('/A(?:)/su',
        '/A(?:[p{N}p{Ll}])/su',
        '/A(?:[^p{Lu}])/su',
        '/A(?:[^p{Lu}]|I)/su',
        '/A(?:[^p{Lu}])/su',
        '/A(?:p{Ll})/su',
        '/A(?:p{L}.)/su',
        '/A(?:p{L}.s)/su',
        '/A(?:p{N})/su',
        '/A(?:s*p{Ll})/su',
        '/A(?:)/su',
        '/A(?:p{Lu}[^p{Lu}])/su',
        '/A(?:p{Lu}p{Ll})/su');
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
    $count = 13;

    $sentences = array();
    $sentence = '';
    $before = '';
    $after = substr($text, 0, 10);
    $text = substr($text, 10);

    while($text != '') {
        for($i = 0; $i < $count; $i++) {
            if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                if($is_sentence_boundary[$i]) {
                    array_push($sentences, $sentence);
                    $sentence = '';
                }
                break;
            }
        }

        $first_from_text = $text[0];
        $text = substr($text, 1);
        $first_from_after = $after[0];
        $after = substr($after, 1);
        $before .= $first_from_after;
        $sentence .= $first_from_after;
        $after .= $first_from_text;
    }

    if($sentence != '' && $after != '') {
        array_push($sentences, $sentence.$after);
    }

    return $sentences;
}

$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));
Thursday, August 11, 2022
3

I finally found working solution thanks to this Tibor's answer here: Regex to ignore accents? PHP

My function highlights text ignoring diacritics, spaces, apostrophes and dashes:

  function highlight($pattern, $string)
  {
    $array = str_split($pattern);

    //add or remove characters to be ignored
    $pattern=implode('[s'-]*', $array);  

    //list of letters with diacritics
    $replacements = Array("a" => "[áa]", "e"=>"[ée]", "i"=>"[íi]", "o"=>"[óo]", "u"=>"[úu]", "A" => "[ÁA]", "E"=>"[ÉE]", "I"=>"[ÍI]", "O"=>"[ÓO]", "U"=>"[ÚU]");

    $pattern=str_replace(array_keys($replacements), $replacements, $pattern);  

    //instead of <u> you can use <b>, <i> or even <div> or <span> with css class
    return preg_replace("/(" . $pattern . ")/ui", "<u>\1</u>", $string);
  }
Tuesday, August 23, 2022
 
1

Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.

A better approach is to use a BreakIterator configured with the right Locale.

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
    end != BreakIterator.DONE;
    start = end, end = iterator.next()) {
  System.out.println(source.substring(start,end));
}

Yields the following result:

  1. This is a test.
  2. This is a T.L.A. test.
  3. Now with a Dr. in it.
Saturday, December 3, 2022
5

A list of Unicode properties can be found in http://www.unicode.org/Public/UNIDATA/PropList.txt.

The properties for each character can be found in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt (1.2 MB).

In your case,

  • + (PLUS SIGN) is Sm,
  • - (HYPHEN-MINUS) is Pd,
  • * (ASTERISK) is Po,
  • / (SOLIDUS) is also Po, and
  • ^ (CIRCUMFLEX ACCENT) is Sk.

You're better off matching them with [-+*/^].

Tuesday, November 22, 2022
 
mrbyte
 
1

The solution is to match and capture the abbreviations and build the replacement using a callback:

var re = /b(w.w.)|([.?!])s+(?=[A-Za-z])/g; 
var str = 'This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.';
var result = str.replace(re, function(m, g1, g2){
  return g1 ? g1 : g2+"r";
});
var arr = result.split("r");
document.body.innerHTML = "<pre>" + JSON.stringify(arr, 0, 4) + "</pre>";

Regex explanation:

  • b(w.w.) - match and capture into Group 1 the abbreviation (consisting of a word character, then . and again a word character and a .) as a whole word
  • | - or...
  • ([.?!])s+(?=[A-Za-z]):
    • ([.?!]) - match and capture into Group 2 either . or ? or !
    • s+ - match 1 or more whitespace symbols...
    • (?=[A-Za-z]) - that are before an ASCII letter.
Thursday, November 17, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :