Set으로 담기엔 너무 크다

가끔 어떤 데이터 집합의 unique 한 원소 개수를 구하고 싶을 때가 있습니다.

저의 경우는 지표를 얻기 위해서 였는데요, 간단한 예시로 unique 한 하루 서비스 접속 사용자 수를 구한다고 해보겠습니다.

가장 먼저 떠오르는 방법은 Set에 모든 사용자의 id를 넣는 건데, 이 때 사용자 id는 int64 타입이라고 하겠습니다.

실제로 unique 한 사용자 수가 1억명이라고 해보겠습니다.

1억 * 8 byte = 0.8GB

그렇게 큰 크기는 아니지만 지표를 위해 서버의 메모리를 800MB나 잡아먹는건 분명 낭비입니다.

심지어 이런 지표가 10개, 20개 … 많아질수록 점점 서버가 감당하기 힘들어 질 수 있구요.

이를 위해 등장한 HyperLogLog

이런 문제를 해결할 수 있는 알고리즘이 있는데요, 바로 HyperLogLog 입니다.

HyperLogLog는 적은 바이트로 unique 한 원소 수를 대략적으로 구할 수 있는 알고리즘입니다.

아래는 셰익스피어 전 작품에 사용된 총 67,801개의 단어를 세는데 사용된 메모리를 비교한 표입니다.

적은 오차율에 비해 메모리 차이가 월등하게 나는 것을 알 수 있습니다.

방식	사용 메모리(바이트)	유일한 단어 개수(결과)	상대 오차
HashSet	10,447,016	67,801	0%
HyperLogLog	512	70,002	3%

출처: https://highscalability.com/big-data-counting-how-to-count-a-billion-distinct-objects-us

일단 HyperLogLog 사용해보기

마침 java 라이브러리중에 HyperLogLog를 구현해놓은게 있어서 한 번 성능을 테스트 해봤습니다.

사용한 코드는 아래와 같습니다.

import java.util.stream.LongStream;
import net.agkn.hll.HLL;
import org.apache.curator.shaded.com.google.common.hash.HashFunction;
import org.apache.curator.shaded.com.google.common.hash.Hashing;
import org.junit.jupiter.api.Test;
import org.openjdk.jol.info.GraphLayout;
 
class HyperLogLogTest {
 
    @Test
    void test() {
        HashFunction hashFunction = Hashing.murmur3_128();
        long numberOfElements = 100_000_000; // 1억
        HLL hll = new HLL(14, 5);
        System.out.println("hll size before add data: " + GraphLayout.parseInstance(hll).toFootprint());
        var totalStartTime = System.nanoTime();
        LongStream.range(0, numberOfElements).forEach(element -> {
                    var startTime = System.nanoTime();
                    long hashedValue = hashFunction.newHasher().putLong(element).hash().asLong();
                    hll.addRaw(hashedValue);
                    if (element == 0 || element == 25_000_000 || element == 50_000_000|| element == 75_000_000 || element == 99_999_999) {
                        System.out.println("element: {" + element + "}");
                        System.out.println("time: " + ((System.nanoTime() - startTime) / 1_000_000.0) + "ms");
                    }
                }
        );
        System.out.println();
        System.out.println("total time: " + ((System.nanoTime() - totalStartTime) / 1_000_000.0) + "ms");
        System.out.println("hll size after add data: " + GraphLayout.parseInstance(hll).toFootprint());
        System.out.println("hll.cardinality() = " + hll.cardinality());
    }
}

사용한 해시 함수: MurMur3_128
테스트 환경: Apple M1 Max 32GB RAM
HLL 설정: log2m: 14, regwidth: 5

1억개의 서로 다른 숫자 데이터를 넣을 때 걸린 시간과 HLL의 메모리 사용량은 아래와 같았습니다.