Start line:  
End line:  

Snippet Preview

Snippet HTML Code

Stack Overflow Questions
   * Licensed to the Apache Software Foundation (ASF) under one
   * or more contributor license agreements.  See the NOTICE file
   * distributed with this work for additional information
   * regarding copyright ownership.  The ASF licenses this file
   * to you under the Apache License, Version 2.0 (the
   * "License"); you may not use this file except in compliance
   * with the License.  You may obtain a copy of the License at
  * Unless required by applicable law or agreed to in writing,
  * software distributed under the License is distributed on an
  * KIND, either express or implied.  See the License for the
  * specific language governing permissions and limitations
  * under the License.
 package org.apache.shindig.gadgets.encoding;

Attempts to determine the encoding of a given string. Highly skewed towards common encodings (UTF-8 and Latin-1).
 public class EncodingDetector {
   private static final Charset UTF_8 = Charset.forName("UTF-8");
   private static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1");

Returns the detected encoding of the given byte array.

input The data to detect the encoding for.
assume88591IfNotUtf8 True to assume that the encoding is ISO-8859-1 (the standard encoding for HTTP) if the bytes are not valid UTF-8. Only recommended if you can reasonably expect that other encodings are going to be specified. Full encoding detection is very expensive!
The detected encoding.
   public static Charset detectEncoding(byte[] inputboolean assume88591IfNotUtf8) {
     if (looksLikeValidUtf8(input)) {
       return ;
     if (assume88591IfNotUtf8) {
       return ;
     // Fall back to the incredibly slow ICU. It might be better to just skip this entirely.
     CharsetDetector detector = new CharsetDetector();
     CharsetMatch match = detector.detect();
     return Charset.forName(match.getName().toUpperCase());

A pretty good test that something is UTF-8. There are many sequences that will pass here that aren't valid UTF-8 due to the requirement that the shortest possible sequence always be used. We're ok with this behavior because the main goal is speed.
   private static boolean looksLikeValidUtf8(byte[] input) {
     int i = 0;
     if (input.length >= 3 &&
        (input[0] & 0xFF) == 0xEF &&
        (input[1] & 0xFF) == 0xBB &
        (input[2] & 0xFF) == 0xBF) {
       // Skip BOM.
       i = 3;
     int endOfSequence;
     for (int j = input.lengthi < j; ++i) {
       int bite = input[i];
       if ((bite & 0x80) == 0) {
         continue// ASCII
       // Determine number of bytes in the sequence.
       if ((bite & 0x0E0) == 0x0C0) {
         endOfSequence = i + 1;
       } else if ((bite & 0x0F0) == 0x0E0) {
         endOfSequence = i + 2;
       } else if ((bite & 0x0F8) == 0xF0) {
         endOfSequence = i + 3;
       } else {
         // Not a valid utf-8 byte sequence. Skip.
         return false;
       while (i < endOfSequence) {
         bite = input[i];
         if ((bite & 0xC0) != 0x80) {
           // High bit not set, not a vlaid sequence
          return false;
    return true;
New to GrepCode? Check out our FAQ X