Viewed   116 times

I have a C# routine that imports data from a CSV file, matches it against a database and then rewrites it to a file. The source file seems to have a few non-ASCII characters that are fouling up the processing routine.

I already have a static method that I run each input field through but it performs basic checks like removing commas and quotes. Does anybody know how I could add functionality that removes non-ASCII characters too?

 Answers

5
string sOut = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s))
Friday, November 18, 2022
1

To simply remove the non-ASCII characters, you could use base R's iconv(), setting sub = "". Something like this should work:

x <- c("Ekstrxf8m", "Jxf6reskog", "bixdfchen Zxfcrcher") # e.g. from ?iconv
Encoding(x) <- "latin1"  # (just to make sure)
x
# [1] "Ekstrøm"         "Jöreskog"        "bißchen Zürcher"

iconv(x, "latin1", "ASCII", sub="")
# [1] "Ekstrm"        "Jreskog"       "bichen Zrcher"

To locate non-ASCII characters, or to find if there were any at all in your files, you could likely adapt the following ideas:

## Do *any* lines contain non-ASCII characters? 
any(grepl("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII")))
[1] TRUE

## Find which lines (e.g. read in by readLines()) contain non-ASCII characters
grep("I_WAS_NOT_ASCII", iconv(x, "latin1", "ASCII", sub="I_WAS_NOT_ASCII"))
[1] 1 2 3
Tuesday, September 20, 2022
 
2

Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.

String clean = str.replaceAll("\P{Print}", "");

Here, p{Print} represents a POSIX character class for printable ASCII characters, while P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because starts an escape sequence in string literals.)


Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.

This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.

That's my guess, but if I've misread the situation and you really do need to strip out literal xHH escapes, you can do it with the following regular expression.

String clean = str.replaceAll("\\x\p{XDigit}{2}", "");

The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.

Thursday, September 29, 2022
 
rzcoder
 
4

There are no guarantees about the particular values but you shouldn't care, because your software will probably never encounter a system which is not compatible in this way with ASCII. Assume that space is always 32 and that A is always 65, this works fine in the modern world.

The C standard only guarantees that letters A-Z and a-z exist and that they fit within a single byte.

It does guarantee that 0-9 are sequential.

In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.

Justification

There are a lot of character encodings out in the world. If you care about portability, you can either make your program portable to different character sets, or you can choose one character set to use everywhere (e.g. Unicode). I'll go ahead and loosely categorize most existing character encodings for you:

  1. Single byte character encodings compatible with ISO/IEC 646. Digits 0-9 and letters A-Z and a-z always occupy the same positions.

  2. Multibyte character encodings (Big5, Shift JIS, ISO 2022-based). In these encodings, your program is probably already broken and you'll need to spend time fixing it if you care. However, parsing numbers will still work as expected.

  3. Unicode encodings. Digits 0-9 and letters A-Z, a-z always occupy the same positions. You can either work with code points or code units freely and you will get the same result, if you are working with code points below 128 (which you are). (Are you working with UTF-7? No, you should only use that for email.

  4. EBCDIC. Digits and letters are assigned different values than their values in ASCII, however, 0-9 and A-F, a-f are still contiguous. Even then, the chance that your code will run on an EBCDIC system is essentially zero.

So the question here is: Do you think that a hypothetical fifth option will be invented in the future, somehow less compatible / more difficult to use than Unicode?

Do you care about EBCDIC?

We could dream up bizarre systems all day... suppose CHAR_BIT is 11, or sizeof(long) = 100, or suppose we use one's complement arithmetic, or malloc() always returns NULL, or suppose the pixels on your monitor are arranged in a hexagonal grid. Suppose your floating-point numbers aren't IEEE 754, suppose all of your data pointers are different sizes. At the end of the day, this does not get us closer to our goals of writing working software on actual modern systems (with the occasional exception).

Thursday, August 11, 2022
 
ngambit
 
5

i can't make sense of your examples, but if you want to convert a string containing hexadecimal ascii characters to its byte value (e.g. so the string "56" becomes the byte 0x56, you can use this (which assumes your system is using ASCII)

uint8_t*
hex_decode(const char *in, size_t len,uint8_t *out)
{
        unsigned int i, t, hn, ln;

        for (t = 0,i = 0; i < len; i+=2,++t) {

                hn = in[i] > '9' ? in[i] - 'A' + 10 : in[i] - '0';
                ln = in[i+1] > '9' ? in[i+1] - 'A' + 10 : in[i+1] - '0';

                out[t] = (hn << 4 ) | ln;
        }

        return out;
}

You'd use it like e.g.

char x[]="1234";
uint8_t res[2];
hex_decode(x,strlen(x),res);

And res (which must be at least half the length of the in parameter) now contains the 2 bytes 0x12,0x34

Note also that this code needs the hexadecimal letters A-F to be capital, a-f won't do (and it doesn't do any error checking - so you'll have to pass it valid stuff).

Friday, September 9, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :