Viewed   130 times

I need to replace Microsoft Word's version of single and double quotations marks (“ ” ‘ ’) with regular quotes (' and ") due to an encoding issue in my application. I do not need them to be HTML entities and I cannot change my database schema.

I have two options: to use either a regular expression or an associated array.

Is there a better way to do this?

 Answers

1

Considering you only want to replace a few specific and well identified characters, I would go for str_replace with an array: you obviously don't need the heavy artillery regex will bring you ;-)

And if you encounter some other special characters (damn copy-paste from Microsoft Word...), you can just add them to that array whenever is necessary / whenever they are identified.


The best answer I can give to your comment is probably this link: Convert Smart Quotes with PHP

And the associated code (quoting that page):

function convert_smart_quotes($string) 
{ 
    $search = array(chr(145), 
                    chr(146), 
                    chr(147), 
                    chr(148), 
                    chr(151)); 

    $replace = array("'", 
                     "'", 
                     '"', 
                     '"', 
                     '-'); 

    return str_replace($search, $replace, $string); 
} 

(I don't have Microsoft Word on this computer, so I can't test by myself)

I don't remember exactly what we used at work (I was not the one having to deal with that kind of input), but it was the same kind of stuff...

Tuesday, December 20, 2022
2

Use

$mail->CharSet = 'UTF-8';

instead of

$mail->CharSet = "utf8";

Monday, September 12, 2022
 
vino
 
1

I think this is a little cleaner and avoids reference bugs:

function unMagicQuotify($ar) {
  $fixed = array();
  foreach ($ar as $key=>$val) {
    if (is_array($val)) {
      $fixed[stripslashes($key)] = unMagicQuotify($val);
    } else {
      $fixed[stripslashes($key)] = stripslashes($val);
    }
  }
  return $fixed;
}

$process = array($_GET,$_POST,$_COOKIE,$_REQUEST);
$fixed = array();
foreach ($process as $index=>$glob) {
  $fixed[$index] = unMagicQuotify($glob);
}
list($_GET,$_POST,$_COOKIE,$_REQUEST) = $fixed;
Wednesday, November 16, 2022
4
echo bin2hex($string);

or:

for ($i = 0; $i < strlen($string); $i++) {
    echo str_pad(dechex(ord($string[$i])), 2, '0', STR_PAD_LEFT);
}

$string is the variable which contains input.

Saturday, August 6, 2022
 
mayjak
 
4
$string = preg_replace('~R~u', "rn", $string);

If you don't want to replace all Unicode newlines but only CRLF style ones, use:

$string = preg_replace('~(*BSR_ANYCRLF)R~', "rn", $string);

R matches these newlines, u is a modifier to treat the input string as UTF-8.


From the PCRE docs:

What R matches

By default, the sequence R in a pattern matches any Unicode newline sequence, whatever has been selected as the line ending sequence. If you specify

     --enable-bsr-anycrlf

the default is changed so that R matches only CR, LF, or CRLF. Whatever is selected when PCRE is built can be overridden when the library functions are called.

and

Newline sequences

Outside a character class, by default, the escape sequence R matches any Unicode newline sequence. In non-UTF-8 mode R is equivalent to the following:

    (?>rn|n|x0b|f|r|x85)

This is an example of an "atomic group", details of which are given below. This particular group matches either the two-character sequence CR followed by LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next line, U+0085). The two-character sequence is treated as a single unit that cannot be split.

In UTF-8 mode, two additional characters whose codepoints are greater than 255 are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). Unicode character property support is not needed for these characters to be recognized.

It is possible to restrict R to match only CR, LF, or CRLF (instead of the complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. (BSR is an abbrevation for "backslash R".) This can be made the default when PCRE is built; if this is the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option. It is also possible to specify these settings by starting a pattern string with one of the following sequences:

    (*BSR_ANYCRLF)   CR, LF, or CRLF only
    (*BSR_UNICODE)   any Unicode newline sequence

These override the default and the options given to pcre_compile() or pcre_compile2(), but they can be overridden by options given to pcre_exec() or pcre_dfa_exec(). Note that these special settings, which are not Perl-compatible, are recognized only at the very start of a pattern, and that they must be in upper case. If more than one of them is present, the last one is used. They can be combined with a change of newline convention; for example, a pattern can start with:

    (*ANY)(*BSR_ANYCRLF)

They can also be combined with the (*UTF8) or (*UCP) special sequences. Inside a character class, R is treated as an unrecognized escape sequence, and so matches the letter "R" by default, but causes an error if PCRE_EXTRA is set.

Monday, September 5, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :