Start line:  
End line:  

Snippet Preview

Snippet HTML Code

Stack Overflow Questions
  /*
   *
   * Licensed to the Apache Software Foundation (ASF) under one
   * or more contributor license agreements.  See the NOTICE file
   * distributed with this work for additional information
   * regarding copyright ownership.  The ASF licenses this file
   * to you under the Apache License, Version 2.0 (the
   * "License"); you may not use this file except in compliance
   * with the License.  You may obtain a copy of the License at
  *
  *     http://www.apache.org/licenses/LICENSE-2.0
  *
  * Unless required by applicable law or agreed to in writing, software
  * distributed under the License is distributed on an "AS IS" BASIS,
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
 
 package org.apache.hadoop.hbase.util;
 
 
Implements a Bloom filter, as defined by Bloom in 1970.

The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by the networking research community in the past decade thanks to the bandwidth efficiencies that it offers for the transmission of set membership information between networked hosts. A sender encodes the information into a bit vector, the Bloom filter, that is more compact than a conventional representation. Computation and space costs for construction are linear in the number of elements. The receiver uses the filter to test whether various elements are members of the set. Though the filter will occasionally return a false positive, it will never return a false negative. When creating the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.

Originally inspired by European Commission One-Lab Project 034819. Bloom filters are very sensitive to the number of elements inserted into them. For HBase, the number of entries depends on the size of the data stored in the column. Currently the default region size is 256MB, so entry count ~= 256MB / (average value size for column). Despite this rule of thumb, there is no efficient way to calculate the entry count after compactions. Therefore, it is often easier to use a dynamic bloom filter that will add extra space instead of allowing the error rate to grow. ( http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/BloomFilterSurvey .pdf ) m denotes the number of bits in the Bloom filter (bitSize) n denotes the number of elements inserted into the Bloom filter (maxKeys) k represents the number of hash functions used (nbHash) e represents the desired false positive rate for the bloom (err) If we fix the error rate (e) and know the number of entries, then the optimal bloom size m = -(n * ln(err) / (ln(2)^2) ~= n * ln(err) / ln(0.6185) The probability of false positives is minimized when k = m/n ln(2).

See also:
BloomFilter The general behavior of a filter
Space/Time Trade-Offs in Hash Coding with Allowable Errors
 
 public class ByteBloomFilter implements BloomFilterBloomFilterWriter {

  
Current file format version
 
   public static final int VERSION = 1;

  
Bytes (B) in the array. This actually has to fit into an int.
 
   protected long byteSize;
  
Number of hash functions
 
   protected int hashCount;
  
Hash type
 
   protected final int hashType;
  
Hash Function
 
   protected final Hash hash;
  
Keys currently in the bloom
 
   protected int keyCount;
  
Max Keys expected for the bloom
 
   protected int maxKeys;
  
Bloom bits
 
  protected ByteBuffer bloom;

  
Record separator for the Bloom filter statistics human-readable string
  public static final String STATS_RECORD_SEP = "; ";

  
Used in computing the optimal Bloom filter size. This approximately equals 0.480453.
  public static final double LOG2_SQUARED = Math.log(2) * Math.log(2);

  
A random number generator to use for "fake lookups" when testing to estimate the ideal false positive rate.
  private static Random randomGeneratorForTest;

  
Bit-value lookup array to prevent doing the same work over and over
  private static final byte [] bitvals = {
    (byte) 0x01,
    (byte) 0x02,
    (byte) 0x04,
    (byte) 0x08,
    (byte) 0x10,
    (byte) 0x20,
    (byte) 0x40,
    (byte) 0x80
  };

  
Loads bloom filter meta data from file input.

Parameters:
meta stored bloom meta data
Throws:
java.lang.IllegalArgumentException meta data is invalid
  public ByteBloomFilter(DataInput meta)
      throws IOExceptionIllegalArgumentException {
    this. = meta.readInt();
    this. = meta.readInt();
    this. = meta.readInt();
    this. = meta.readInt();
    this. = this.;
    this. = Hash.getInstance(this.);
    if ( == null) {
      throw new IllegalArgumentException("Invalid hash type: " + );
    }
    sanityCheck();
  }

  

Parameters:
maxKeys
errorRate
Returns:
the number of bits for a Bloom filter than can hold the given number of keys and provide the given error rate, assuming that the optimal number of hash functions is used and it does not have to be an integer.
  public static long computeBitSize(long maxKeysdouble errorRate) {
    return (long) Math.ceil(maxKeys * (-Math.log(errorRate) / ));
  }

  
The maximum number of keys we can put into a Bloom filter of a certain size to maintain the given error rate, assuming the number of hash functions is chosen optimally and does not even have to be an integer (hence the "ideal" in the function name).

Parameters:
bitSize
errorRate
Returns:
maximum number of keys that can be inserted into the Bloom filter
See also:
computeMaxKeys(long,double,int) for a more precise estimate
  public static long idealMaxKeys(long bitSizedouble errorRate) {
    // The reason we need to use floor here is that otherwise we might put
    // more keys in a Bloom filter than is allowed by the target error rate.
    return (long) (bitSize * ( / -Math.log(errorRate)));
  }

  
The maximum number of keys we can put into a Bloom filter of a certain size to get the given error rate, with the given number of hash functions.

Parameters:
bitSize
errorRate
hashCount
Returns:
the maximum number of keys that can be inserted in a Bloom filter to maintain the target error rate, if the number of hash functions is provided.
  public static long computeMaxKeys(long bitSizedouble errorRate,
      int hashCount) {
    return (long) (-bitSize * 1.0 / hashCount *
        Math.log(1 - Math.exp(Math.log(errorRate) / hashCount)));
  }

  
Computes the error rate for this Bloom filter, taking into account the actual number of hash functions and keys inserted. The return value of this function changes as a Bloom filter is being populated. Used for reporting the actual error rate of compound Bloom filters when writing them out.

Returns:
error rate for this particular Bloom filter
  public double actualErrorRate() {
    return actualErrorRate( * 8, );
  }

  
Computes the actual error rate for the given number of elements, number of bits, and number of hash functions. Taken directly from the Wikipedia Bloom filter article.

Parameters:
maxKeys
bitSize
functionCount
Returns:
the actual error rate
  public static double actualErrorRate(long maxKeyslong bitSize,
      int functionCount) {
    return Math.exp(Math.log(1 - Math.exp(-functionCount * maxKeys * 1.0
        / bitSize)) * functionCount);
  }

  
Increases the given byte size of a Bloom filter until it can be folded by the given factor.

Parameters:
bitSize
foldFactor
Returns:
Foldable byte size
  public static int computeFoldableByteSize(long bitSizeint foldFactor) {
    long byteSizeLong = (bitSize + 7) / 8;
    int mask = (1 << foldFactor) - 1;
    if ((mask & byteSizeLong) != 0) {
      byteSizeLong >>= foldFactor;
      ++byteSizeLong;
      byteSizeLong <<= foldFactor;
    }
    if (byteSizeLong > .) {
      throw new IllegalArgumentException("byteSize=" + byteSizeLong + " too "
          + "large for bitSize=" + bitSize + ", foldFactor=" + foldFactor);
    }
    return (intbyteSizeLong;
  }
  private static int optimalFunctionCount(int maxKeyslong bitSize) {
    long i = bitSize / maxKeys;
    double result = Math.ceil(Math.log(2) * i);
    if (result > .){
      throw new IllegalArgumentException("result too large for integer value.");
    }
    return (int)result;
  }

  
Private constructor used by other constructors.
  private ByteBloomFilter(int hashType) {
    this. = hashType;
    this. = Hash.getInstance(hashType);
  }

  
Determines & initializes bloom filter meta data from user config. Call allocBloom() to allocate bloom filter data.

Parameters:
maxKeys Maximum expected number of keys that will be stored in this bloom
errorRate Desired false positive error rate. Lower rate = more storage required
hashType Type of hash function to use
foldFactor When finished adding entries, you may be able to 'fold' this bloom to save space. Tradeoff potentially excess bytes in bloom for ability to fold if keyCount is exponentially greater than maxKeys.
Throws:
java.lang.IllegalArgumentException
  public ByteBloomFilter(int maxKeysdouble errorRateint hashType,
      int foldFactorthrows IllegalArgumentException {
    this(hashType);
    long bitSize = computeBitSize(maxKeyserrorRate);
     = optimalFunctionCount(maxKeysbitSize);
    this. = maxKeys;
    // increase byteSize so folding is possible
     = computeFoldableByteSize(bitSizefoldFactor);
    sanityCheck();
  }

  
Creates a Bloom filter of the given size.

Parameters:
byteSizeHint the desired number of bytes for the Bloom filter bit array. Will be increased so that folding is possible.
errorRate target false positive rate of the Bloom filter
hashType Bloom filter hash function type
foldFactor
Returns:
the new Bloom filter of the desired size
  public static ByteBloomFilter createBySize(int byteSizeHint,
      double errorRateint hashTypeint foldFactor) {
    ByteBloomFilter bbf = new ByteBloomFilter(hashType);
    bbf.byteSize = computeFoldableByteSize(byteSizeHint * 8L, foldFactor);
    long bitSize = bbf.byteSize * 8;
    bbf.maxKeys = (intidealMaxKeys(bitSizeerrorRate);
    bbf.hashCount = optimalFunctionCount(bbf.maxKeysbitSize);
    // Adjust max keys to bring error rate closer to what was requested,
    // because byteSize was adjusted to allow for folding, and hashCount was
    // rounded.
    bbf.maxKeys = (intcomputeMaxKeys(bitSizeerrorRatebbf.hashCount);
    return bbf;
  }

  
Creates another similar Bloom filter. Does not copy the actual bits, and sets the new filter's key count to zero.

Returns:
a Bloom filter with the same configuration as this
    bbf.byteSize = ;
    bbf.hashCount = ;
    bbf.maxKeys = ;
    return bbf;
  }
  public void allocBloom() {
    if (this. != null) {
      throw new IllegalArgumentException("can only create bloom once.");
    }
    this. = ByteBuffer.allocate((int)this.);
    assert this..hasArray();
  }
    if(0 >= this. || this. > .) {
      throw new IllegalArgumentException("Invalid byteSize: " + this.);
    }
    if(this. <= 0) {
      throw new IllegalArgumentException("Hash function count must be > 0");
    }
    if (this. == null) {
      throw new IllegalArgumentException("hashType must be known");
    }
    if (this. < 0) {
      throw new IllegalArgumentException("must have positive keyCount");
    }
  }
  void bloomCheck(ByteBuffer bloom)  throws IllegalArgumentException {
    if (this. != bloom.limit()) {
      throw new IllegalArgumentException(
          "Configured bloom length should match actual length");
    }
  }
  public void add(byte [] buf) {
    add(buf, 0, buf.length);
  }
  public void add(byte [] bufint offsetint len) {
    /*
     * For faster hashing, use combinatorial generation
     * http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf
     */
    int hash1 = this..hash(bufoffsetlen, 0);
    int hash2 = this..hash(bufoffsetlenhash1);
    for (int i = 0; i < this.i++) {
      long hashLoc = Math.abs((hash1 + i * hash2) % (this. * 8));
      set(hashLoc);
    }
    ++this.;
  }

  
Should only be used in tests
  boolean contains(byte [] buf) {
    return contains(buf, 0, buf.lengththis.);
  }

  
Should only be used in tests
  boolean contains(byte [] bufint offsetint length) {
    return contains(bufoffsetlength);
  }

  
Should only be used in tests
  boolean contains(byte[] bufByteBuffer bloom) {
    return contains(buf, 0, buf.lengthbloom);
  }
  public boolean contains(byte[] bufint offsetint length,
      ByteBuffer theBloom) {
    if (theBloom == null) {
      // In a version 1 HFile Bloom filter data is stored in a separate meta
      // block which is loaded on demand, but in version 2 it is pre-loaded.
      // We want to use the same API in both cases.
      theBloom = ;
    }
    if (theBloom.limit() != ) {
      throw new IllegalArgumentException("Bloom does not match expected size:"
          + " theBloom.limit()=" + theBloom.limit() + ", byteSize=" + );
    }
    return contains(bufoffsetlengththeBloom.array(),
        theBloom.arrayOffset(), (int);
  }
  public static boolean contains(byte[] bufint offsetint length,
      byte[] bloomArrayint bloomOffsetint bloomSizeHash hash,
      int hashCount) {
    int hash1 = hash.hash(bufoffsetlength, 0);
    int hash2 = hash.hash(bufoffsetlengthhash1);
    int bloomBitSize = bloomSize << 3;
    
    if ( == null) {
      // Production mode.
      int compositeHash = hash1;
      for (int i = 0; i < hashCounti++) {
        int hashLoc = Math.abs(compositeHash % bloomBitSize);
        compositeHash += hash2;
        if (!get(hashLocbloomArraybloomOffset)) {
          return false;
        }
      }
    } else {
      // Test mode with "fake lookups" to estimate "ideal false positive rate".
      for (int i = 0; i < hashCounti++) {
        int hashLoc = .nextInt(bloomBitSize);
        if (!get(hashLocbloomArraybloomOffset)){
          return false;
        }
      }
    }
    return true;
  }
  //---------------------------------------------------------------------------
  
Private helpers


  
Set the bit at the specified index to 1.

Parameters:
pos index of bit
  void set(long pos) {
    int bytePos = (int)(pos / 8);
    int bitPos = (int)(pos % 8);
    byte curByte = .get(bytePos);
    curByte |= [bitPos];
    .put(bytePoscurByte);
  }

  
Check if bit at specified index is 1.

Parameters:
pos index of bit
Returns:
true if bit at specified index is 1, false if 0.
  static boolean get(int posbyte[] bloomArrayint bloomOffset) {
    int bytePos = pos >> 3; //pos / 8
    int bitPos = pos & 0x7; //pos % 8
    byte curByte = bloomArray[bloomOffset + bytePos];
    curByte &= [bitPos];
    return (curByte != 0);
  }
  public long getKeyCount() {
    return ;
  }
  public long getMaxKeys() {
    return ;
  }
  public long getByteSize() {
    return ;
  }
  public int getHashType() {
    return ;
  }
  public void compactBloom() {
    // see if the actual size is exponentially smaller than expected.
    if (this. > 0 && this..hasArray()) {
      int pieces = 1;
      int newByteSize = (int)this.;
      int newMaxKeys = this.;
      // while exponentially smaller & folding is lossless
      while ( (newByteSize & 1) == 0 && newMaxKeys > (this.<<1) ) {
        pieces <<= 1;
        newByteSize >>= 1;
        newMaxKeys >>= 1;
      }
      // if we should fold these into pieces
      if (pieces > 1) {
        byte[] array = this..array();
        int start = this..arrayOffset();
        int end = start + newByteSize;
        int off = end;
        for(int p = 1; p < pieces; ++p) {
          for(int pos = startpos < end; ++pos) {
            array[pos] |= array[off++];
          }
        }
        // folding done, only use a subset of this array
        this..rewind();
        this..limit(newByteSize);
        this. = this..slice();
        this. = newByteSize;
        this. = newMaxKeys;
      }
    }
  }
  //---------------------------------------------------------------------------

  
Writes just the bloom filter to the output array

Parameters:
out OutputStream to place bloom
Throws:
java.io.IOException Error writing bloom array
  public void writeBloom(final DataOutput outthrows IOException {
    if (!this..hasArray()) {
      throw new IOException("Only writes ByteBuffer with underlying array.");
    }
    out.write(.array(), .arrayOffset(), .limit());
  }
  public Writable getMetaWriter() {
    return new MetaWriter();
  }
  public Writable getDataWriter() {
    return new DataWriter();
  }
  private class MetaWriter implements Writable {
    protected MetaWriter() {}
    @Override
    public void readFields(DataInput arg0throws IOException {
      throw new IOException("Cant read with this class.");
    }
    @Override
    public void write(DataOutput outthrows IOException {
      out.writeInt();
      out.writeInt((int);
      out.writeInt();
      out.writeInt();
      out.writeInt();
    }
  }
  private class DataWriter implements Writable {
    protected DataWriter() {}
    @Override
    public void readFields(DataInput arg0throws IOException {
      throw new IOException("Cant read with this class.");
    }
    @Override
    public void write(DataOutput outthrows IOException {
      writeBloom(out);
    }
  }
  public int getHashCount() {
    return ;
  }
  public boolean supportsAutoLoading() {
    return  != null;
  }
  public static void setFakeLookupMode(boolean enabled) {
    if (enabled) {
       = new Random(283742987L);
    } else {
       = null;
    }
  }

  
Just concatenate row and column by default. May return the original row buffer if the column qualifier is empty.
  public byte[] createBloomKey(byte[] rowBufint rowOffsetint rowLen,
      byte[] qualBufint qualOffsetint qualLen) {
    // Optimize the frequent case when only the row is provided.
    if (qualLen <= 0 && rowOffset == 0 && rowLen == rowBuf.length)
      return rowBuf;
    byte [] result = new byte[rowLen + qualLen];
    System.arraycopy(rowBufrowOffsetresult, 0,  rowLen);
    if (qualLen > 0)
      System.arraycopy(qualBufqualOffsetresultrowLenqualLen);
    return result;
  }
  public KVComparator getComparator() {
//    return Bytes.BYTES_RAWCOMPARATOR;
    return .;
  }

  
A human-readable string with statistics for the given Bloom filter.

Parameters:
bloomFilter the Bloom filter to output statistics for;
Returns:
a string consisting of "<key>: <value>" parts separated by STATS_RECORD_SEP.
  public static String formatStats(BloomFilterBase bloomFilter) {
    StringBuilder sb = new StringBuilder();
    long k = bloomFilter.getKeyCount();
    long m = bloomFilter.getMaxKeys();
    sb.append("BloomSize: " + bloomFilter.getByteSize() + );
    sb.append("No of Keys in bloom: " + k + );
    sb.append("Max Keys for bloom: " + m);
    if (m > 0) {
      sb.append( + "Percentage filled: "
          + NumberFormat.getPercentInstance().format(k * 1.0 / m));
    }
    return sb.toString();
  }
  public String toString() {
    return formatStats(this) +  + "Actual error rate: "
        + String.format("%.8f"actualErrorRate());
  }
New to GrepCode? Check out our FAQ X