Kafka01
# 1.1 Introduction (opens new window)
Event streaming is the digital equivalent of the human body's central nervous system.
It is the technological foundation for the 'always-on' world where businesses are increasingly software-defined and automated, and where the user of software is more software.
事件流是人体中枢神经系统的数字等价物。它是“永远在线”世界的技术基础,在这个世界上,企业越来越多地由软件定义和自动化,软件的用户更多地是软件。
Technically speaking, event streaming is the practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events; storing these event streams durably for later retrieval; manipulating, processing, and reacting to the event streams in real-time as well as retrospectively; and routing the event streams to different destination technologies as needed. Event streaming thus ensures a continuous flow and interpretation of data so that the right information is at the right place, at the right time.
从技术上讲,事件流是以事件流的形式从数据库、传感器、移动设备、云服务和软件应用程序等事件源实时捕获数据的做法;持久地存储这些事件流以供以后检索;实时地以及回顾性地操纵、处理和响应事件流;以及根据需要将事件流路由到不同的目的地技术。因此,事件流确保了数据的连续流动和解释,从而使正确的信息在正确的时间、正确的地点出现。
- later retrieval:以后检索(消费之后数据还有)
- 事件:事件流,大数据流,流式处理平台
Event streaming is applied to a wide variety of use cases (opens new window) across a plethora of industries and organizations. Its many examples include:事件流应用于众多行业和组织中的各种用例。它的许多例子包括:
- To process payments and financial transactions in real-time, such as in stock exchanges, banks, and insurances.
- To track and monitor cars, trucks, fleets, and shipments in real-time, such as in logistics and the automotive industry.
- To continuously capture and analyze sensor data from IoT devices or other equipment, such as in factories and wind parks. 持续捕获和分析来自物联网设备或其他设备(如工厂和风电场)的传感器数据。
- To collect and immediately react to customer interactions and orders, such as in retail, the hotel and travel industry, and mobile applications. 收集客户互动和订单并立即做出反应,如零售、酒店和旅游业以及移动应用程序。
- To monitor patients in hospital care and predict changes in condition to ensure timely treatment in emergencies. 监测医院护理中的患者并预测病情变化,以确保在紧急情况下及时治疗。
- To connect, store, and make available data produced by different divisions of a company. 连接、存储公司不同部门产生的数据并使其可用。
- To serve as the foundation for data platforms, event-driven architectures, and microservices. 作为数据平台、事件驱动架构和微服务的基础。
- plethora
/ˈplɛθərə/
:plethora of industries - in logistics:在物流方面
/ləˈdʒɪstɪks/
物流、后勤 - different divisions of a company. 公司的不同部门。
# Apache Kafka® is an event streaming platform. What does that mean? (opens new window)
Kafka combines three key capabilities so you can implement your use cases (opens new window) for event streaming end-to-end with a single battle-tested solution:
- To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
- To store streams of events durably and reliably for as long as you want.
- To process streams of events as they occur or retrospectively.
And all this functionality is provided in a distributed, highly scalable, elastic, fault-tolerant, and secure manner. Kafka can be deployed on bare-metal hardware, virtual machines, and containers, and on-premises as well as in the cloud. You can choose between self-managing your Kafka environments and using fully managed services offered by a variety of vendors.
- a variety of vendors 各种供应商
# How does Kafka work in a nutshell? ※※※※ (opens new window)
Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol (opens new window). It can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments.
Servers: Kafka is run as a cluster of one or more servers that can span multiple datacenters or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect (opens new window) to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters. To let you implement mission-critical use cases, a Kafka cluster is highly scalable and fault-tolerant: if any of its servers fails, the other servers will take over their work to ensure continuous operations without any data loss.
Clients: They allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures. Kafka ships with some such clients included, which are augmented by dozens of clients (opens new window) provided by the Kafka community: clients are available for Java and Scala including the higher-level Kafka Streams (opens new window) library, for Go, Python, C/C++, and many other programming languages as well as REST APIs.
服务器:Kafka作为一个或多个服务器的集群运行,这些服务器可以跨越多个数据中心或云区域。其中一些服务器形成了称为代理的存储层。其他服务器运行KafkaConnect,将数据作为事件流不断导入和导出,以将Kafka与您现有的系统(如关系数据库和其他Kafka集群)集成。为了让您实现任务关键型用例,Kafka集群具有高度的可扩展性和容错性:如果它的任何服务器出现故障,其他服务器将接管它们的工作,以确保连续操作而不会丢失任何数据。
客户端:它们允许您编写分布式应用程序和微服务,即使在网络问题或机器故障的情况下,也可以并行、大规模、容错地读取、写入和处理事件流。
- span:跨度广度。span multiple datacenters 跨多个数据中心
- ships with:附带
# Main Concepts and Terminology (opens new window)
An event records the fact that "something happened" in the world or in your business. It is also called record or message in the documentation. When you read or write data to Kafka, you do this in the form of events. Conceptually, an event has a key, value, timestamp, and optional metadata headers. Here's an example event:
Producers are those client applications that publish (write) events to Kafka, and consumers are those that subscribe to (read and process) these events. In Kafka, producers and consumers are fully decoupled and agnostic of each other, which is a key design element to achieve the high scalability that Kafka is known for. For example, producers never need to wait for consumers. Kafka provides various guarantees (opens new window) such as the ability to process events exactly-once.
Events are organized and durably stored in topics. Very simplified, a topic is similar to a folder in a filesystem, and the events are the files in that folder. An example topic name could be "payments". Topics in Kafka are always multi-producer and multi-subscriber: a topic can have zero, one, or many producers that write events to it, as well as zero, one, or many consumers that subscribe to these events. Events in a topic can be read as often as needed—unlike traditional messaging systems, events are not deleted after consumption. Instead, you define for how long Kafka should retain your events through a per-topic configuration setting, after which old events will be discarded. Kafka's performance is effectively constant with respect to data size, so storing data for a long time is perfectly fine.
Topics are partitioned, meaning a topic is spread over a number of "buckets" located on different Kafka brokers. This distributed placement of your data is very important for scalability because it allows client applications to both read and write the data from/to many brokers at the same time. When a new event is published to a topic, it is actually appended to one of the topic's partitions. Events with the same event key (e.g., a customer or vehicle ID) are written to the same partition, and Kafka guarantees (opens new window) that any consumer of a given topic-partition will always read that partition's events in exactly the same order as they were written.
To make your data fault-tolerant and highly-available, every topic can be replicated, even across geo-regions or datacenters, so that there are always multiple brokers that have a copy of the data just in case things go wrong, you want to do maintenance on the brokers, and so on. A common production setting is a replication factor of 3, i.e., there will always be three copies of your data. This replication is performed at the level of topic-partitions.
This primer should be sufficient for an introduction. The Design (opens new window) section of the documentation explains Kafka's various concepts in full detail, if you are interested.
# 1.2 Use Cases (opens new window)
Here is a description of a few of the popular use cases for Apache Kafka®. For an overview of a number of these areas in action, see this blog post (opens new window).
# Messaging (opens new window)
Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.
In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often depend on the strong durability guarantees Kafka provides.
In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ (opens new window) or RabbitMQ (opens new window).
# Website Activity Tracking (opens new window)
这段文字介绍了 Kafka 最初的设计目的把用户的网站活动记录下来,放到中央主题里面。介绍了一些数据流的用途和活动跟踪的特点,突出了 Kafka 在处理实时数据流和大数据量方面的优势。
The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.
Kafka最初的用例是能够将用户活动跟踪管道重建为一组实时发布订阅数据流。这意味着网站活动(页面视图、搜索或用户可能采取的其他操作)发布到中心主题,每个活动类型有一个主题。这些提要可用于订阅一系列用例,包括实时处理、实时监控和加载到Hadoop或离线数据仓库系统中进行离线处理和报告。
- 主语:The original use case for Kafka
- 谓语:was to be able to rebuild
- 词组:rebuild as
- a user activity tracking pipeline
- a set of real-time publish-subscribe feeds.
This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type.
这表示网站活动(页面浏览、搜索或用户可能采取的其他操作)被发布到中央主题,每种活动类型对应一个主题。
These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.
这些数据流可用于订阅,适用于各种用例,包括实时处理、实时监控,以及加载到 Hadoop 或离线数据仓库系统进行离线处理和报告。
- 主语(Subject):These feeds 这些数据流
- 谓语(Verb):are available for subscription
- 宾语(Object):for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.
- warehousing 专有名词:数据仓库
Activity tracking is often very high volume as many activity messages are generated for each user page view.
由于每个用户页面视图都会生成许多活动消息,因此活动跟踪的量通常非常大。
- 主语(Subject):Activity tracking
- 谓语(Verb):is often
- 宾语(Object):very high volume as many activity messages are generated for each user page view.
解释句子:这句话表达了活动跟踪通常具有很高的数据量,因为每个用户页面浏览都会生成许多活动消息。这指的是在活动跟踪过程中,由于每个用户的页面浏览都会生成多个活动消息,因此整体的数据量较大。
# Metrics (opens new window)
Kafka is often used for operational monitoring data.
This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.
# Log Aggregation (opens new window)
Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing.
Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages.
This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption.
In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency.
# Stream Processing (opens new window)
Many users of Kafka process data in processing pipelines consisting of multiple stages, where raw input data is consumed from Kafka topics and then aggregated, enriched, or otherwise transformed into new topics for further consumption or follow-up processing.
- For example, a processing pipeline for recommending news articles might crawl article content from RSS feeds and publish it to an "articles" topic; further processing might normalize or deduplicate this content and publish the cleansed article content to a new topic; a final processing stage might attempt to recommend this content to users.
- Such processing pipelines create graphs of real-time data flows based on the individual topics.
- Starting in 0.10.0.0, a light-weight but powerful stream processing library called Kafka Streams (opens new window) is available in Apache Kafka to perform such data processing as described above.
Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm (opens new window) and Apache Samza (opens new window).
# Event Sourcing (opens new window)
- Event sourcing (opens new window) is a style of application design where state changes are logged as a time-ordered sequence of records.
- Kafka's support for very large stored log data makes it an excellent backend for an application built in this style.
# Commit Log (opens new window)
Kafka can serve as a kind of external commit-log for a distributed system.
- The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data.
- The log compaction (opens new window) feature in Kafka helps support this usage.
- In this usage Kafka is similar to Apache BookKeeper (opens new window) project.
Upgrading KRaft-based clusters (opens new window)
# 升级基于KRaft的集群 (opens new window)
If you are upgrading from a version prior to 3.3.0, please see the note in step 3 below. Once you have changed the metadata.version to the latest version, it will not be possible to downgrade to a version prior to 3.3-IV0. 如果您是从3.3.0之前的版本升级的,请参阅下面步骤3中的说明。一旦您将metadata.version更改为最新版本,就不可能降级到3.3-IV0之前的版本。
For a rolling upgrade: 对于滚动升级: Upgrade the brokers one at a time: shut down the broker, update the code, and restart it. Once you have done so, the brokers will be running the latest version and you can verify that the cluster's behavior and performance meets expectations. 一次升级一个代理:关闭代理,更新代码,然后重新启动。完成后,代理将运行最新版本,您可以验证集群的行为和性能是否符合预期。
- 在计算机科学和信息技术领域,"broker" 通常指的是消息中间件中的经纪人或代理。
Once the cluster's behavior and performance has been verified, bump the metadata.version by running ./bin/kafka-features.sh upgrade --metadata 3.5 一旦集群的行为和性能得到验证,就可以通过运行#0来提升元数据版本#
- "bump" 的意思是增加或升级;一般是撞击
Note that cluster metadata downgrade is not supported in this version since it has metadata changes. Every MetadataVersion (opens new window) after 3.2.x has a boolean parameter that indicates if there are metadata changes (i.e. IBP_3_3_IV3(7, "3.3", "IV3", true) means this version has metadata changes).
Given your current and target versions, a downgrade is only possible if there are no metadata changes in the versions between. 请注意,此版本不支持群集元数据降级,因为它有元数据更改。3.2.x之后的每个MetadataVersion都有一个布尔参数,指示是否有元数据更改(即 IBP_3_3_IV3(7, "3.3", "IV3", true) 表示此版本有元数据更改)。考虑到您的当前版本和目标版本,只有在版本之间没有元数据更改的情况下,才有可能降级。
# Upgrading to 3.1.0 from any version 0.8.x through 3.0.x (opens new window)
# 从0.8.x到3.0.x的任何版本升级到3.1.0 (opens new window)
If you are upgrading from a version prior to 2.1.x, please see the note below about the change to the schema used to store consumer offsets.
Once you have changed the inter.broker.protocol.version to the latest version, it will not be possible to downgrade to a version prior to 2.1. 如果您是从2.1.x之前的版本升级的,请参阅下面关于用于存储使用者偏移的架构更改的说明。一旦您将inter.broker.procol.version更改为最新版本,就不可能降级到2.1之前的版本。
For a rolling upgrade: 对于滚动升级:
Update server.properties on all brokers and add the following properties.
CURRENT_KAFKA_VERSION refers to the version you are upgrading from. CURRENT_MESSAGE_FORMAT_VERSION refers to the message format version currently in use. If you have previously overridden the message format version, you should keep its current value. Alternatively, if you are upgrading from a version prior to 0.11.0.x, then CURRENT_MESSAGE_FORMAT_VERSION should be set to match CURRENT_KAFKA_VERSION.
在所有代理上更新server.properties并添加以下属性。
CURRENT_KAFKA_VERSION是指您要升级的版本。
CURRENT_MESSAGE_FORMAT_VERSION是指当前使用的消息格式版本。如果您以前覆盖过消息格式版本,则应保留其当前值。或者,如果从0.11.0.x之前的版本升级,则应将CURRENT_MESSAGE_FORMAT_version设置为与CURRENT_KAFKA_version匹配。
inter.broker.protocol.version=CURRENT_KAFKA_VERSION (e.g. 3.0, 2.8, etc.) inter.broker.procol.version=CURRENT_KAFKA_version(例如 3.0 、 2.8 等) log.message.format.version=CURRENT_MESSAGE_FORMAT_VERSION (See potential performance impact following the upgrade (opens new window) for the details on what this configuration does.)
log.message.format.version=CURRENT_message_format_version(有关此配置的作用的详细信息,请参阅升级后的潜在性能影响。)
If you are upgrading from version 0.11.0.x or above, and you have not overridden the message format, then you only need to override the inter-broker protocol version. 如果您是从0.11.0.x或更高版本升级的,并且您没有覆盖消息格式,那么您只需要覆盖中间代理协议版本。 inter.broker.protocol.version=CURRENT_KAFKA_VERSION (e.g. 3.0, 2.8, etc.) inter.broker.procol.version=CURRENT_KAFKA_version(例如 3.0 、 2.8 等)
Upgrade the brokers one at a time: shut down the broker, update the code, and restart it. Once you have done so, the brokers will be running the latest version and you can verify that the cluster's behavior and performance meets expectations. It is still possible to downgrade at this point if there are any problems. 一次升级一个代理:关闭代理,更新代码,然后重新启动。完成后,代理将运行最新版本,您可以验证集群的行为和性能是否符合预期。如果出现任何问题,目前仍有可能降级。 Once the cluster's behavior and performance has been verified, bump the protocol version by editing inter.broker.protocol.version and setting it to 3.1. 验证集群的行为和性能后,通过编辑 inter.broker.protocol.version 并将其设置为 3.1 来提升协议版本。
Restart the brokers one by one for the new protocol version to take effect. Once the brokers begin using the latest protocol version, it will no longer be possible to downgrade the cluster to an older version. 逐个重新启动代理程序,使新协议版本生效。一旦代理程序开始使用最新的协议版本,就无法再将集群降级到旧版本。 If you have overridden the message format version as instructed above, then you need to do one more rolling restart to upgrade it to its latest version. Once all (or most) consumers have been upgraded to 0.11.0 or later, change log.message.format.version to 3.1 on each broker and restart them one by one. Note that the older Scala clients, which are no longer maintained, do not support the message format introduced in 0.11, so to avoid conversion costs (or to take advantage of exactly once semantics (opens new window)), the newer Java clients must be used. 如果您已按照上面的指示重写了消息格式版本,则需要再次滚动重新启动以将其升级到最新版本。一旦所有(或大多数)使用者都升级到0.11.0或更高版本,请在每个代理上将log.message.format.version更改为3.1,然后逐个重新启动它们。请注意,不再维护的旧Scala客户端不支持0.11中引入的消息格式,因此为了避免转换成本(或利用一次性语义),必须使用新的Java客户端。