项目Kafka参数调优

1. 背景

随着业务的增值，系统从 Kafka 读取消息的日常流量涨到了 8 万条/秒，高峰期整点流量涨到了 20万条/秒。CPU 也跟着水涨船高，如果不对系统进行优化，就只能靠堆机器解决了（当前机器集群为 15 * 4C8G）。

最开始，我们的思路是用火焰图分析下系统 CPU 的耗费处，原以为业务代码还有优化之处，没想到火焰图的最终结果显示，大部分的 CPU 都耗费在 Consumer 的 poll 操作上。所以只能从 Kafka Consumer 下手了。

于是，有了这篇文章—— 《Kafka 调优与详细参数说明》。

在写完《Kafka 调优与详细参数说明》之后，大概知道，如果要增加吞吐量，需要调高 fetch.min.bytes 的值，但是会增加延时，并且实际每次拉取消息条数和拉取数据量也并不清楚，简而言之，缺少监控。

于是，有了下面这篇文章——《Kafka业务监控》。

2. 调优

在拿到业务监控数据后发现，每次 poll 确实只拉取了 1 条数据，数据大小只有 400Byte，平均延时大约 10ms，因为数据量较小、延时很低，所以调高 fetch.min.bytes 应该没问题。

2.1 调优参数

在将 fetch.min.bytes 的值设置为 1024，在调整该参数值时，同时需要考虑调整 fetch-max-wait ，因为消息如果长时间无法达到 fetch.min.bytes ，poll 操作将会一直阻塞，直至达到 fetch-max-wait 的时间。

但是结合自身业务，消息量非常大，几乎不会出现长时间无法达到 fetch.min.bytes 的情况。即使极端出现了这种情况， fetch.min.bytes 的默认值 500ms，这种程度的延时业务也是完全可以接受的。

下面再回顾一下这两个参数的解释，也可以直接看《Kafka 调优与详细参数说明》。

从 kafka 源码里获取的注释，所在类为 org.apache.kafka.clients.consumer.ConsumerConfig。

/**
    * <code>fetch.min.bytes</code>
    */
   public static final String FETCH_MIN_BYTES_CONFIG = "fetch.min.bytes";
   private static final String FETCH_MIN_BYTES_DOC = "The minimum amount of data the server should return for a fetch request. If insufficient data is available the request will wait for that much data to accumulate before answering the request. The default setting of 1 byte means that fetch requests are answered as soon as a single byte of data is available or the fetch request times out waiting for data to arrive. Setting this to something greater than 1 will cause the server to wait for larger amounts of data to accumulate which can improve server throughput a bit at the cost of some additional latency.";

其中，FETCH_MIN_BYTES_DOC 是对 fetch.min.bytes 配置的说明。

/**
     * <code>fetch.max.wait.ms</code>
     */
    public static final String FETCH_MAX_WAIT_MS_CONFIG = "fetch.max.wait.ms";
    private static final String FETCH_MAX_WAIT_MS_DOC = "The maximum amount of time the server will block before answering the fetch request if there isn't sufficient data to immediately satisfy the requirement given by fetch.min.bytes.";

这个参数定义在拉取数据时最大等待时间，防止消费延时过高，一般和 fetch.min.bytes 往往配合使用。

fetch-max-wait has no effect when fetch.min.bytes is 1。

2.2 调优效果

每次拉取 kafka 消费数量增加、消息量增加
平均 poll 间隔时间增大
平均消费等待时间增大
CPU使用率降低 25% 左右

小结：通过降低时效性减小 cpu 资源利用率，适用于对时效性没有很高要求的业务场景，具体降低的时效可查看 kafka 监控面板中的平均/最大消费等待时间、拉取平均/最大延迟等指标。

3. 参考

1、极客时间专栏《Kafka核心技术与实战》

2、官方文档：https://kafka.apache.org/documentation/#consumerconfigs