Viewed   103 times

In the "PHP Cookbook", they say (p.589) that to properly set the char encoding of outgoing data to utf-8 it is necessary to edit the default_encoding configuration to utf-8.

However, I cannot find this configuration in php.ini. Should I simply add a line that would say default_encoding = "utf-8"?

I do have a ;default_charset = "iso-8859-1" . As you can see (;), right now it is not activated. Should I remove the semi-colon and set it to "utf-8"? Does that take care of the default encoding?

I also found other encoding directives that I don't know what to do about:

[iconv]
;iconv.input_encoding = ISO-8859-1
;iconv.internal_encoding = ISO-8859-1
;iconv.output_encoding = ISO-8859-1
...
; http://php.net/exif.encode-unicode
;exif.encode_unicode = ISO-8859-15
...
;mssql.charset = "ISO-8859-1"
...
;exif.encode_unicode = ISO-8859-15

Is there any reason why I shouldn't simply replace them all with utf-8?

 Answers

5

You should set your default_charset to UTF-8:

default_charset = "utf-8"

(PHP Cookbook may have a typo in it if they ask you to change the default_encoding — I've never heard of it.)

You'll also want to make sure that your webserver is set to output UTF-8 if you're going to outputting UTF-8 encoded characters. In Apache this can be set by in the httpd.conf file:

AddDefaultCharset UTF-8

As for modifying the iconv, exif, and mssql encoding settings, you probably don't need to set these (your settings have these commented out anyhow) but it's a good idea to change them all to UTF-8 anyhow.

Monday, September 26, 2022
2

Note: The following applies to Windows PowerShell.
See the next section for the cross-platform PowerShell Core (v6+) edition.

  • On PSv5.1 or higher, where > and >> are effectively aliases of Out-File, you can set the default encoding for > / >> / Out-File via the $PSDefaultParameterValues preference variable:

    • $PSDefaultParameterValues['Out-File:Encoding'] = 'utf8'
  • On PSv5.0 or below, you cannot change the encoding for > / >>, but, on PSv3 or higher, the above technique does work for explicit calls to Out-File.
    (The $PSDefaultParameterValues preference variable was introduced in PSv3.0).

  • On PSv3.0 or higher, if you want to set the default encoding for all cmdlets that support
    an -Encoding parameter
    (which in PSv5.1+ includes > and >>), use:

    • $PSDefaultParameterValues['*:Encoding'] = 'utf8'

If you place this command in your $PROFILE, cmdlets such as Out-File and Set-Content will use UTF-8 encoding by default, but note that this makes it a session-global setting that will affect all commands / scripts that do not explicitly specify an encoding via their -Encoding parameter.

Similarly, be sure to include such commands in your scripts or modules that you want to behave the same way, so that they indeed behave the same even when run by another user or a different machine; however, to avoid a session-global change, use the following form to create a local copy of $PSDefaultParameterValues:

  • $PSDefaultParameterValues = @{ '*:Encoding' = 'utf8' }

Caveat: PowerShell, as of v5.1, invariably creates UTF-8 files with a (pseudo) BOM, which is customary only in the Windows world - Unix-based utilities do not recognize this BOM (see bottom); see this post for workarounds that create BOM-less UTF-8 files.

For a summary of the wildly inconsistent default character encoding behavior across many of the Windows PowerShell standard cmdlets, see the bottom section.


The automatic $OutputEncoding variable is unrelated, and only applies to how PowerShell communicates with external programs (what encoding PowerShell uses when sending strings to them) - it has nothing to do with the encoding that the output redirection operators and PowerShell cmdlets use to save to files.


Optional reading: The cross-platform perspective: PowerShell Core:

PowerShell is now cross-platform, via its PowerShell Core edition, whose encoding - sensibly - defaults to BOM-less UTF-8, in line with Unix-like platforms.

  • This means that source-code files without a BOM are assumed to be UTF-8, and using > / Out-File / Set-Content defaults to BOM-less UTF-8; explicit use of the utf8 -Encoding argument too creates BOM-less UTF-8, but you can opt to create files with the pseudo-BOM with the utf8bom value.

  • If you create PowerShell scripts with an editor on a Unix-like platform and nowadays even on Windows with cross-platform editors such as Visual Studio Code and Sublime Text, the resulting *.ps1 file will typically not have a UTF-8 pseudo-BOM:

    • This works fine on PowerShell Core.
    • It may break on Windows PowerShell, if the file contains non-ASCII characters; if you do need to use non-ASCII characters in your scripts, save them as UTF-8 with BOM.
      Without the BOM, Windows PowerShell (mis)interprets your script as being encoded in the legacy "ANSI" codepage (determined by the system locale for pre-Unicode applications; e.g., Windows-1252 on US-English systems).
  • Conversely, files that do have the UTF-8 pseudo-BOM can be problematic on Unix-like platforms, as they cause Unix utilities such as cat, sed, and awk - and even some editors such as gedit - to pass the pseudo-BOM through, i.e., to treat it as data.

    • This may not always be a problem, but definitely can be, such as when you try to read a file into a string in bash with, say, text=$(cat file) or text=$(<file) - the resulting variable will contain the pseudo-BOM as the first 3 bytes.

Inconsistent default encoding behavior in Windows PowerShell:

Regrettably, the default character encoding used in Windows PowerShell is wildly inconsistent; the cross-platform PowerShell Core edition, as discussed in the previous section, has commendably put and end to this.

Note:

  • The following doesn't aspire to cover all standard cmdlets.

  • Googling cmdlet names to find their help topics now shows you the PowerShell Core version of the topics by default; use the version drop-down list above the list of topics on the left to switch to a Windows PowerShell version.

  • As of this writing, the documentation frequently incorrectly claims that ASCII is the default encoding in Windows PowerShell - see this GitHub docs issue.


Cmdlets that write:

Out-File and > / >> create "Unicode" - UTF-16LE - files by default - in which every ASCII-range character (too) is represented by 2 bytes - which notably differs from Set-Content / Add-Content (see next point); New-ModuleManifest and Export-CliXml also create UTF-16LE files.

Set-Content (and Add-Content if the file doesn't yet exist / is empty) uses ANSI encoding (the encoding specified by the active system locale's ANSI legacy code page, which PowerShell calls Default).

Export-Csv indeed creates ASCII files, as documented, but see the notes re -Append below.

Export-PSSession creates UTF-8 files with BOM by default.

New-Item -Type File -Value currently creates BOM-less(!) UTF-8.

The Send-MailMessage help topic also claims that ASCII encoding is the default - I have not personally verified that claim.

Start-Transcript invariably creates UTF-8 files with BOM, but see the notes re -Append below.

Re commands that append to an existing file:

>> / Out-File -Append make no attempt to match the encoding of a file's existing content. That is, they blindly apply their default encoding, unless instructed otherwise with -Encoding, which is not an option with >> (except indirectly in PSv5.1+, via $PSDefaultParameterValues, as shown above). In short: you must know the encoding of an existing file's content and append using that same encoding.

Add-Content is the laudable exception: in the absence of an explicit -Encoding argument, it detects the existing encoding and automatically applies it to the new content.Thanks, js2010. Note that in Windows PowerShell this means that it is ANSI encoding that is applied if the existing content has no BOM, whereas it is UTF-8 in PowerShell Core.

This inconsistency between Out-File -Append / >> and Add-Content, which also affects PowerShell Core, is discussed in this GitHub issue.

Export-Csv -Append partially matches the existing encoding: it blindly appends UTF-8 if the existing file's encoding is any of ASCII/UTF-8/ANSI, but correctly matches UTF-16LE and UTF-16BE.
To put it differently: in the absence of a BOM, Export-Csv -Append assumes UTF-8 is, whereas Add-Content assumes ANSI.

Start-Transcript -Append partially matches the existing encoding: It correctly matches encodings with BOM, but defaults to potentially lossy ASCII encoding in the absence of one.


Cmdlets that read (that is, the encoding used in the absence of a BOM):

Get-Content and Import-PowerShellDataFile default to ANSI (Default), which is consistent with Set-Content.
ANSI is also what the PowerShell engine itself defaults to when it reads source code from files.

By contrast, Import-Csv, Import-CliXml and Select-String assume UTF-8 in the absence of a BOM.

Friday, September 16, 2022
4

deceze is right, that is ISO8859-1 (AKA Latin-1) encoded text. Try this:

file_contents = CSV.read("csvfile.csv", col_sep: "$", encoding: "ISO8859-1")

And if that doesn't work, you can use Iconv to fix up the individual strings with something like this:

require 'iconv'
utf8_string = Iconv.iconv('utf-8', 'iso8859-1', latin1_string).first

If latin1_string is "Non spxE9cifixE9", then utf8_string will be "Non spécifié". Also, Iconv.iconv can unmangle whole arrays at a time:

utf8_strings = Iconv.iconv('utf-8', 'iso8859-1', *latin1_strings)

With newer Rubies, you can do things like this:

utf8_string = latin1_string.force_encoding('iso-8859-1').encode('utf-8')

where latin1_string thinks it is in ASCII-8BIT but is really in ISO-8859-1.

Wednesday, August 31, 2022
 
jakubri
 
2

(ruby 1.9.2)

#encoding: UTF-8
class String
  def contains_cjk?
    !!(self =~ /p{Han}|p{Katakana}|p{Hiragana}|p{Hangul}/)
  end
end

strings= ['??', '?? ????', '????????', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}

#true
#true
#true
#false

p{} matches a character’s Unicode script.
The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.

Wow. Ruby Regexp source .

Friday, December 16, 2022
 
btx
 
btx
5

Everything looks fine to me.

>>> hex(ord(u'°'))
'0xb0'
>>> hex(ord(u'ç'))
'0xe7'

Perhaps you should decode the JSON before attempting to use it.

Sunday, November 27, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :