Kafka Çorbası: Apache Pulsar

Giriş

Not : Java kullanımı için Pulsar API yazısına bakabilirsiniz.

Apache Pulsar Nedir?

Açıklaması şöyle. En büyük özelliği ölçeklendirmeye çok uygun olması. Pulsar, Kafka ile aynı API'yi destekliyor.

Apache Pulsar is a cloud-native, multi-tenant, high-performance solution for server-to-server messaging and queuing built on the publisher-subscribe (pub-sub) pattern. Pulsar combines the best features of a traditional messaging system like RabbitMQ with those of a pub-sub system like Apache Kafka — scaling up or down dynamically without downtime. It’s used by thousands of companies for high-performance data pipelines, microservices, instant messaging, data integrations, and more.

Originally developed by Yahoo, Pulsar is under the stewardship of the Apache Software Foundation.

Açıklaması şöyle. Pulsar, Kafka'nın rakibi.

Apache Pulsar, which was originally developed at Yahoo! and donated to the Apache Software Foundation in 2016, was designed to take advantage of modern systems design while at the same time supporting traditional message exchange patterns. Pulsar is a distributed system that uses a log-structured storage architecture and is designed to provide high scalability and low latency.

What’s really revolutionary about Pulsar is that it was explicitly designed to handle the requirements of both modern streaming and the traditional broker on a single platform. This means that Pulsar has several features that will sound familiar to users of traditional message brokers, such as tracking of unacknowledged messages, redelivery of messages, and individual message acknowledgements. Pulsar supports these features at the same time providing the high performance and scalability expected of a modern streaming platform.

On top of all that, because Pulsar emerged from an internet-scale enterprise, it also supports enterprise-grade features such as multi-tenancy and geo-replication.

2021 yılındaki bir anket şöyle

According to the community-led 2021 Pulsar user’s survey, the second-highest use of Apache Pulsar is within asynchronous architectures to support microservices-based applications.

Mimari

Açıklaması şöyle

Kafka is one distributed cluster — after removing the ZooKeeper dependency in 2022), Pulsar is three distributed clusters (Pulsar brokers, ZooKeeper, BookKeeper)

Pulsar Functions

Açıklaması şöyle

Pulsar Functions, a serverless-based compute framework that’s a part of Apache Pulsar. Pulsar Functions can be used to encapsulate atomic logic wrapping a database mutation with the publishing of a corresponding event as a part of a single “transaction”. In microservices parlance, this is known as a “local transaction” that guarantees atomicity for events that flow into microservices systems. Unlike CDC, Pulsar Functions require microservices developers to code logic to deliver atomicity guarantees. Pulsar Functions are a great option when complex logic is needed to support local transactions.

Kafka'nın Eksiklikleri

Kafka'nın eksik olduğu noktaları şöyle sıralıyorlar. Bu 4 noktayı Apache Cassandra sağlıyor, dolayısıyla amaç aynı yetenekleri Message Queue olarak ta elde etmek

We started by evaluating the most popular option, Apache Kafka. We found that it came up short in four areas:
1. Geo-replication
2. Scaling
3. Multi-tenancy
4. Queuing

Apache Pulsar solves all of these problems to our satisfaction.

1. Geo-replication

Açıklaması şöyle

Cassandra supports synchronous and asynchronous replication within or across data centers. Most often, Cassandra is configured for synchronous replication within a region, and asynchronous replication across regions. This enables Cassandra users like Netflix to serve customers everywhere with local latency, to comply with data sovereignty regulations, and to survive infrastructure failures. (When AWS rebooted 218 Cassandra nodes to patch a security vulnerability, Netflix experienced zero downtime.)

Kafka is designed to run in a single region and does not support cross-datacenter replication. Clients outside the region where Kafka is deployed must simply tolerate the increased latency. There are several projects that attempt to add cross-datacenter replication to Kafka at the client level, but these are necessarily difficult to operate and prone to failure.

Like Cassandra, Pulsar builds geo-replication into the core server. Also like Cassandra, you can choose to deploy this in a synchronous or asynchronous configuration, and you can configure replication by topic. Producers can write to a shared topic from any region, and Pulsar takes care of ensuring those messages are visible to consumers everywhere.

2. Scaling

Kafka şeklen şöyle

Açıklaması şöyle. Yani problem verinin Leader Broker'a yazılması ve asenkron olarak Follower Broker tarafından okunması

In Kafka, the client sends messages to the leader broker through the PRODUCE request. These messages are persisted to the node locally. Each follower reads the data from the leader through the FETCH request to store a copy of the messages. In this leader-follower architecture, each broker needs to handle both data processing and storage.

One disadvantage of the design is that it guarantees the most recent and relevant data replica is only stored on the leader broker, which serves both producers and consumers. This means a Kafka cluster can be overwhelmed during traffic bursts as the load may not be spread across followers.

Pulsar ise şeklen şöyle

Açıklaması şöyle. Yani iki kademe var. Yazma isteği hangi Broker'a giderse gitsin, o Broker doğru Bookie'ye yönlendiriyor.

By contrast, Pulsar separates serving (brokers) and storage (bookies) into different layers. At the computing layer, all Pulsar brokers are stateless and equivalent to each other. As shown in the figure below, the client sends messages to the broker through the SEND request. After processing the messages, the broker delivers them to different bookies through the ADD_ENTRY request. Specifically, data are written to bookies based on the configured writing strategy (that is, the values of ensemble size, write quorum, and ack quorum). This helps achieve data high availability across storage nodes. Most importantly, both brokers and bookies can be easily scaled at their own layer, not impacting the other.

This cloud-native architecture of Pulsar features great scalability, availability, and resiliency, providing users with solutions to some key pain points in Kafka. Therefore, many users are looking for a graceful plan to migrate from Kafka to Pulsar.

Bookie

Açıklaması şöyle

In Kafka, the unit of storage is a segment file, but the unit of replication is all the segment files in a partition. Each partition is owned by a single leader broker, which replicates to several followers. So when you need to add capacity to your Kafka cluster, some partitions have to be copied to the new node before it can participate in reducing the load on the existing nodes.

This means that adding capacity to a Kafka cluster makes it slower before it makes it faster. If your capacity planning is on point, then this is fine, but if business needs change faster than you expected then it could be a serious problem.

Pulsar adds a layer of indirection. (Pulsar also splits apart compute and storage, which are managed by the broker and the bookie, respectively, but the important part here is how Pulsar, via Bookkeeper, increases the granularity of replication.) In Pulsar, partitions are split up into ledgers, but unlike Kafka segments, ledgers can be replicated independently of one another. Pulsar keeps a map of which ledgers belong to a partition in Zookeeper. So when we add a new storage node to the cluster, all we have to do is start a new ledger on that node. Existing data can stay where it is—no extra work needs to be done by the cluster.

Scaling için en önemli şeylerden bir tanesi Apache Bookkeeper. Şeklen şöyle

Açıklaması şöyle

Originally developed at Yahoo, BookKeeper represents a reliable, high-performance storage system. It provides distributed, scalable storage services, featuring low latency and strong fault tolerance. These speak volumes about why it is capable of serving as Pulsar’s storage layer. BookKeeper stores data in ledgers, which are append-only and immutable. With a special replication protocol, BookKeeper stores log entries securely across multiple nodes in a concurrent way, which are highly available.

As the names suggest, you probably can tell what BookKeeper and ledgers are used for in a cloud-native environment. If not, just imagine a bookkeeper using ledgers to record all relevant account information to track the finances of a business.

3. Multi-tenancy

Açıklaması şöyle

Multi-tenant infrastructure can be shared across multiple users and organizations while isolating them from each other. The activities of one tenant should not be able to affect the security or the SLAs of other tenants.

Fundamentally, multi-tenancy reduces costs in two ways. First, simply by sharing infrastructure that isn’t maxed out by a single tenant — the cost of that component can be amortized across all users. Second, by simplifying administration — when there are dozens or hundreds or thousands of tenants, managing a single instance offers significant simplification. Even in a containerized world, “get me an account on this shared system” is much easier to fulfill than “stand me up a new instance of this service.” And global problems may be obscured by being scattered across many instances.

Like geo-replication, multi-tenancy is hard to graft on to a system that wasn’t designed for it. Kafka is a single-tenant design, but Pulsar builds multi-tenancy in at the core.

Pulsar enables us to manage multiple tenants across multiple regions from a single interface that includes authentication and authorization, isolation policy (Pulsar can optionally carve out hardware within the cluster that is dedicated to a single tenant), and storage quotas.

Multi-tenancy için Pulsar şöyle bir yapı sunuyor.

By design, messages in Pulsar are published to topics, and topics are organized in a three-level hierarchy structure ...

With this structure:
One tenant represents a specific business unit or a product line. Topics created under a tenant share the same business context that is distinct from others.

Within one tenant, topics having similar behavioral characteristics can be further grouped into a smaller administrative unit called a namespace.

Different policies such as message retention or expiry policy can be set either at the namespace level or at an individual topic level. Polices set at the namespace level will be applicable to all topics under the namespace.

Şeklen şöyle

Bu şekildeki yapı aynı zamanda bağlantı string'inde de kullanılıyor. Açıklaması şöyle

In Pulsar, the name of a topic actually reflects the following structure:
{persistent|non-persistent}://<tenant_name>/<namespace_name>/<topic_name>

When a client application (e.g. a producer or a consumer) connects to a Pulsar topic, it must specify the full string that contains the tenant and the namespace names. Otherwise, an error will be reported about invalid topic names.

Please note that in Pulsar a topic can be persistent (the topic_name starts with persistent:// prefix) or non-persistent (the topic_name starts with non-persistent:// prefix). In a non-persistent topic, messages are only stored in memory and not persisted to hard drives. Because non-persistent topics are used only rarely in some edge situations, we will not discuss them in this article. For the remainder of this article, I will refer to the persistent topic in Pulsar as simply “topic”.

4. Queuing (as well as streaming)

Açıklaması şöyle

Kafka offers a classic pub/sub (publish/subscribe) messaging model — publishers send messages to Kafka, which orders them by partition within a topic, and sends a copy to every subscriber (or “consumer”).

Kafka records which messages a consumer has seen with an offset into the log. This means that messages cannot be acknowledged out-of-order, which in turn means that a subscription cannot be shared across multiple consumers. (Kafka enables mapping multiple partitions to a single consumer in its consumer group design, but not the other way around.)

This is fine for pub/sub use cases, sometimes called streaming. For streaming, it’s important to consume messages in the same order in which they were published.

Pulsar supports the pub/sub model, but it also supports the queuing model, where processing order is not important and we just want to load balance messages in a topic across an arbitrary number of consumers:

This (and queuing-oriented features like “dead letter queue” and negative acknowledgment with redelivery) means that Pulsar can often replace AMQP and JMS use cases as well as Kafka-style pub/sub, offering a further opportunity for cost reduction to enterprises adopting Pulsar.

Kafka'dan Pulsar'a Geçiş

İki yöntem var

1. Pulsar Adaptor for Apache Kafka

Açıklaması şöyle

Initially, the Pulsar community tried to solve the migration issue by developing a tool called Pulsar Adaptor for Apache Kafka. It allows users to replace the Kafka client dependency with the Pulsar Kafka wrapper. It does not require any changes to the existing code. Nevertheless, its disadvantages are obvious:

- Only applicable to Java-based clients
- Problems in handling Kafka offsets
- Users still need to learn some Pulsar client configurations

2. Kafka-on-Pulsar

Açıklaması şöyle

To provide a smoother migration experience for users, the KoP community came up with a new solution. They decided to bring the native Kafka protocol support to Pulsar by introducing a Kafka protocol handler on Pulsar brokers. Protocol handlers were a new feature introduced in Pulsar 2.5.0. They allow Pulsar brokers to support other messaging protocols, including Kafka, AMQP, and MQTT.

Compared with the above-mentioned migration plans, KoP features the following key benefits:
No Code Change: Users do not need to modify any code in their Kafka applications, including clients written in different languages, the applications themselves, and third-party components

Great Compatibility: KoP is compatible with the majority of tools in the Kafka ecosystem. It currently supports Kafka 0.9+

Direct Interaction With Pulsar Brokers: Before KoP was designed, some users tried to make the Pulsar client serve the request sent by the Kafka client by creating a proxy layer in the middle. This might impact performance as it entailed additional routing requests. By comparison, KoP allows clients to directly communicate with Pulsar brokers without compromising performance

Docker ve Pulsar

Şöyle yaparız

docker run -it -p 6650:6650 -p 8080:8080 \
  --mount source=pulsardata,target=/pulsar/data \
  --mount source=pulsarconf,target=/pulsar/conf \
  apachepulsar/pulsar:2.11.0 \
  bin/pulsar standalone

Docker Compose ve Pulsar

Şöyle yaparız

version: '3.5'

services:
  pulsar:
    image: "apachepulsar/pulsar:2.10.1"
    command: bin/pulsar standalone
    environment:
      PULSAR_MEM: "-Xms512m -Xmx512m -XX:MaxDirectMemorySize=1g"
    volumes:
      - ./pulsar/data:/pulsar/data
    ports:
      - "6650:6650"
      - "8080:8080"
    restart: unless-stopped
    networks:
      - pulsar_network
      
  pulsar-manager:
    image: "apachepulsar/pulsar-manager:v0.2.0"
    ports:
      - "9527:9527"
      - "7750:7750"
    depends_on:
      - pulsar
    environment:
      SPRING_CONFIGURATION_FILE: /pulsar-manager/pulsar-manager/application.properties
    networks:
      - pulsar_network   
  
networks:
  pulsar_network:
    name: pulsar_network
    driver: bridge

Erişmek için şöyle yaparız

pulsar://localhost:6650
veya
http://localhost:8080

pulsar komutu

Örnek

Şöyle yaparız

$ bin/pulsar tokens create \
    --secret-key file:///path/to/token-generation-secret.key \
    --expiry-time 1y \
    --subject my-test-role

Açıklaması şöyle

Using a JWT authentication example, the following Pulsar CLI command creates a JWT token that expires in one year:

The generated token (displayed as the CLI output) is associated with a Role Token named my-test-role. Any client that has the generated JWT token can successfully connect to Pulsar.

pulsar-admin komutu

Örnekler burada

Kafka Çorbası

Thursday, June 1, 2023

Apache Pulsar - Rakip

No comments:

Post a Comment

kafka-consumer-groups.sh komutu

Report Abuse

Labels