guava之BloomFilter

2021-04-25 09:33:16 阅读：226 来源： 互联网

标签：funnel int bits bytes long 哈希 guava BloomFilter

Guava中的布隆过滤器

采用Guava 27.0.1版本的源码，BF的具体逻辑位于com.google.common.hash.BloomFilter类中。开始读代码吧。

BloomFilter类的成员属性

不多，只有4个。

  /** The bit set of the BloomFilter (not necessarily power of 2!) */
  private final LockFreeBitArray bits;

  /** Number of hashes per element */
  private final int numHashFunctions;

  /** The funnel to translate Ts to bytes */
  private final Funnel<? super T> funnel;

  /** The strategy we employ to map an element T to {@code numHashFunctions} bit indexes. */
  private final Strategy strategy;

bits即上文讲到的长度为m的位数组，采用LockFreeBitArray类型做了封装。
numHashFunctions即哈希函数的个数k。
funnel是Funnel接口实现类的实例，它用于将任意类型T的输入数据转化为Java基本类型的数据（byte、int、char等等）。这里是会转化为byte。
strategy是布隆过滤器的哈希策略，即数据如何映射到位数组，其具体方法在BloomFilterStrategies枚举中。

BloomFilter的构造

这个类的构造方法是私有的。要创建它的实例，应该通过公有的create()方法。它一共有5种重载方法，但最终都是调用了如下的逻辑。

  @VisibleForTesting
  static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
    checkNotNull(funnel);
    checkArgument(
        expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions);
    checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp);
    checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp);
    checkNotNull(strategy);

    if (expectedInsertions == 0) {
      expectedInsertions = 1;
    }
    /*
     * TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size
     * is proportional to -log(p), but there is not much of a point after all, e.g.
     * optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares!
     */
    long numBits = optimalNumOfBits(expectedInsertions, fpp);
    int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
    try {
      return new BloomFilter<T>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy);
    } catch (IllegalArgumentException e) {
      throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
    }
  }

该方法接受4个参数：funnel是插入数据的Funnel，expectedInsertions是期望插入的元素总个数n，fpp即期望假阳性率p，strategy即哈希策略。

由上可知，位数组的长度m和哈希函数的个数k分别通过optimalNumOfBits()方法和optimalNumOfHashFunctions()方法来估计。

估计最优m值和k值

  @VisibleForTesting
  static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
      p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
  }

  @VisibleForTesting
  static int optimalNumOfHashFunctions(long n, long m) {
    // (m / n) * log(2), but avoid truncation due to division!
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
  }

要看懂这两个方法，我们得接着上一节的推导继续做下去。

由假阳性率的近似计算方法可知，如果要使假阳性率尽量小，在m和n给定的情况下，k值应为：

这就是optimalNumOfHashFunctions()方法的逻辑。那么m该如何估计呢？

将k代入上一节的式子并化简，我们可以整理出期望假阳性率p与m、n的关系：

亦即：

这就是optimalNumOfBits()方法的逻辑。

从上也可以得出：

如果指定期望假阳性率p，那么最优的m值与期望元素数n呈线性关系。
最优的k值实际上只与p有关，与m和n都无关，即：

所以，在创建BloomFilter时，确定合适的p和n值很重要。

哈希策略

在BloomFilterStrategies枚举中定义了两种哈希策略，都基于著名的MurmurHash算法，分别是MURMUR128_MITZ_32和MURMUR128_MITZ_64。前者是一个简化版，所以我们来看看后者的实现方法。

  MURMUR128_MITZ_64() {
    @Override
    public <T> boolean put(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      boolean bitsChanged = false;
      long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
        combinedHash += hash2;
      }
      return bitsChanged;
    }

    @Override
    public <T> boolean mightContain(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
          return false;
        }
        combinedHash += hash2;
      }
      return true;
    }

    private /* static */ long lowerEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
    }

    private /* static */ long upperEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
    }
  };

其中put()方法负责向布隆过滤器中插入元素，mightContain()方法负责判断元素是否存在。以put()方法为例讲解一下流程吧。

使用MurmurHash算法对funnel的输入数据进行散列，得到128bit（16B）的字节数组。
取低8字节作为第一个哈希值hash1，取高8字节作为第二个哈希值hash2。
进行k次循环，每次循环都用hash1与hash2的复合哈希做散列，然后对m取模，将位数组中的对应比特设为1。

这里需要注意两点：

在循环中实际上应用了双重哈希（double hashing）的思想，即可以用两个哈希函数来模拟k个，其中i为步长：
这种方法在开放定址的哈希表中，也经常用来减少冲突。
哈希值有可能为负数，而负数是不能在位数组中定位的。所以哈希值需要与Long.MAX_VALUE做bitwise AND，直接将其最高位（符号位）置为0，就变成正数了。

位数组具体实现

来看LockFreeBitArray类的部分代码。

  static final class LockFreeBitArray {
    private static final int LONG_ADDRESSABLE_BITS = 6;
    final AtomicLongArray data;
    private final LongAddable bitCount;

    LockFreeBitArray(long bits) {
      this(new long[Ints.checkedCast(LongMath.divide(bits, 64, RoundingMode.CEILING))]);
    }

    // Used by serialization
    LockFreeBitArray(long[] data) {
      checkArgument(data.length > 0, "data length is zero!");
      this.data = new AtomicLongArray(data);
      this.bitCount = LongAddables.create();
      long bitCount = 0;
      for (long value : data) {
        bitCount += Long.bitCount(value);
      }
      this.bitCount.add(bitCount);
    }

    /** Returns true if the bit changed value. */
    boolean set(long bitIndex) {
      if (get(bitIndex)) {
        return false;
      }

      int longIndex = (int) (bitIndex >>> LONG_ADDRESSABLE_BITS);
      long mask = 1L << bitIndex; // only cares about low 6 bits of bitIndex

      long oldValue;
      long newValue;
      do {
        oldValue = data.get(longIndex);
        newValue = oldValue | mask;
        if (oldValue == newValue) {
          return false;
        }
      } while (!data.compareAndSet(longIndex, oldValue, newValue));

      // We turned the bit on, so increment bitCount.
      bitCount.increment();
      return true;
    }

    boolean get(long bitIndex) {
      return (data.get((int) (bitIndex >>> 6)) & (1L << bitIndex)) != 0;
    }
    // ....
}

看官应该能明白为什么它要叫做“LockFree”BitArray了，因为它是采用原子类型AtomicLongArray作为位数组的存储的，确实不需要加锁。另外还有一个Guava中特有的LongAddable类型的计数器，用来统计置为1的比特数。

采用AtomicLongArray除了有并发上的优势之外，更主要的是它可以表示非常长的位数组。一个长整型数占用64bit，因此data[0]可以代表第0~63bit，data[1]代表64~127bit，data[2]代表128~191bit……依次类推。这样设计的话，将下标i无符号右移6位就可以获得data数组中对应的位置，再在其基础上左移i位就可以取得对应的比特了。

最后多嘴一句，上面的代码中用到了Long.bitCount()方法计算long型二进制表示中1的数量，堪称Java语言中最强的骚操作之一：

 public static int bitCount(long i) {
    // HD, Figure 5-14
    i = i - ((i >>> 1) & 0x5555555555555555L);
    i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L);
    i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
    i = i + (i >>> 8);
    i = i + (i >>> 16);
    i = i + (i >>> 32);
    return (int)i & 0x7f;
 }

标签：funnel,int,bits,bytes,long,哈希,guava,BloomFilter
来源： https://www.cnblogs.com/duanxz/p/14699028.html

本站声明： 1. iCode9 技术分享网（下文简称本站）提供的所有内容，仅供技术学习、探讨和分享；
2. 关于本站的所有留言、评论、转载及引用，纯属内容发起人的个人观点，与本站观点和立场无关；
3. 关于本站的所有言论和文字，纯属内容发起人的个人观点，与本站观点和立场无关；
4. 本站文章均是网友提供，不完全保证技术分享内容的完整性、准确性、时效性、风险性和版权归属；如您发现该文章侵犯了您的权益，可联系我们第一时间进行删除；
5. 本站为非盈利性的个人网站，所有内容不会用来进行牟利，也不会利用任何形式的广告来间接获益，纯粹是为了广大技术爱好者提供技术内容和技术思想的分享性交流网站。