Kafka Çorbası: Tiered Storage

Giriş

Açıklaması şöyle

... what exactly does tiered storage mean in the context of a streaming system? The basic idea is to only persist recent data to disk, and asynchronously move historical data to object storage where it can rest cheaply.

Neden Lazım

Açıklaması şöyle. Veri miktarı büyüdükçe her şeyi yerel diskte saklamak doğru olmuyor. Bu yüzden katmanlı depolamaya geçiliyor.

In a traditional Kafka deployment, data is stored on the local disks of Kafka brokers. However, as the volume of data grows, storing all the data on local disks can become expensive and less efficient. Tiered storage provides a solution to this problem by introducing the ability to store data in multiple tiers, typically using a combination of fast and expensive storage and slower and cheaper storage options.

Veri

1. Hot Data : Sıklıkla erişilen veri.

2. Cold Data : Daha az erişilen veri

olarak sınıflandırılıyor. Açıklaması şöyle.

With tiered storage, you can define different storage tiers based on your requirements. For example, you can have a hot tier consisting of high-performance and expensive storage like solid-state drives (SSDs) or in-memory storage for storing frequently accessed or recent data. The hot tier ensures that the most critical data is readily available for fast processing.

On the other hand, you can have a cold tier composed of lower-cost and higher-capacity storage options such as traditional hard disk drives (HDDs) or cloud object storage. The cold tier is used for storing less frequently accessed or older data that may not require immediate processing.

Açıklaması şöyle. Hot Data daha hızlı depolama sisteminde saklanır.

Tiered storage in Kafka allows you to store older data on cheaper, cost-effective storage systems while keeping more recent data on faster storage systems. This feature is available in Confluent Platform (a distribution of Apache Kafka), starting from version 5.4.

2. Ancak

Tiered Storage ile maliyetin gerçekten düştüğüne dair bir kanaat yok. Yani Tiered Storage kullanılmıyor

Sebepleri şöyle

1. Increased Complexity and Operational Burden

Yani Tiered Storage her şeyi daha karmaşık hale getiriyor. Performans garantisi artık verilemiyor.

Çünkü Tiered Storage çözümleri veriyi bloklar halinde okuyorlar. Bu da yüzlerce megabyte büyüklüğündeki verinin önce indirilmesini gerektiriyor.

2. No Reduction in inter-zone Networking

Cloud disklerin pahalı olduğu düşünülüyor ancak esas maliyet cloud networking ile geliyor. Standart High Availability kurulumunda Kafka'nın çalışması için 3 tane farklı availability zone lazım. Ağ maliyeti de Kafka işletmenin maliyetinin %80'inin oluşturuyor. Yani Tiered Storage aslında en pahalı kısmı ele almıyor

3. Ayarlar

3.1 Hot Tier

1. Specify Hot Tier Directory

log.dirs Alanı

Açıklaması şöyle

In the Kafka server properties, you need to define the directory path where the Hot Tier data will be stored. This is done using the log.dirs property. The specified path should point to a location on a storage device that offers high-speed access, such as SSDs or NVMe devices.

Örnek

Şöyle yaparız

# Server Properties for Hot Tier
log.dirs=/path/to/hot/tier

2. Topic Configuration

Açıklaması şöyle

Configure Kafka topics to use the Hot Tier for storing their data. By default, Kafka automatically creates topics if they do not exist. However, in a tiered storage scenario, you might want to disable automatic topic creation to have more control over how topics are configured.

Örnek

Şöyle yaparız

# Disable Automatic Topic Creation
auto.create.topics.enable=false

# Topic-Specific Configuration for the Hot Tier
# Here, "my_hot_topic" is the name of the topic you want to configure for the Hot Tier.
# The log.dirs property points to the directory in the Hot Tier storage.
# You can also customize other topic-specific settings as needed.
topic.config.my_hot_topic=log.dirs=/path/to/hot/tier

3. Additional Tuning

Açıklaması şöyle

Depending on your specific use case and requirements, you may need to tune other Kafka configurations for the Hot Tier. For example, you might adjust parameters related to retention policies, replication factor, and log segment sizes to optimize performance.

Örnek

Şöyle yaparız

# Adjusting Retention Policy (example: retain data for 7 days)
log.retention.hours=168

# Adjusting Replication Factor (example: set replication factor to 3 for fault tolerance)
default.replication.factor=3

# Log Segment Size (example: set log segment size to 1 GB)
log.segment.bytes=1073741824

4. Monitoring and Maintenance

Açıklaması şöyle

Implement monitoring mechanisms to keep track of the Hot Tier’s performance, disk usage, and other relevant metrics. Regularly review these metrics to ensure that the Hot Tier is effectively handling the high-frequency data and making adjustments as needed.

5. Data Movement Policies

Açıklaması şöyle

Consider implementing data movement policies that define when and how data transitions from the Hot Tier to the Cold Tier. These policies could be based on criteria such as the age of the data or its access frequency. This ensures that only the most relevant and frequently accessed data remains in the Hot Tier.

Örnek

Şöyle yaparız

dataMovementPolicy:
  criteria: age
  threshold: 7d

Açıklaması şöyle

In this example, data older than 7 days is configured to be automatically moved from the Hot Tier to the Cold Tier. Configuring the Hot Tier is a crucial aspect of optimizing Kafka Tiered Storage for performance.

3.2 Cold Tier

Örnek

Şöyle yaparız

confluent.tier.feature=true
confluent.tier.backend=S3
confluent.tier.s3.region=<your-aws-s3-region>
confluent.tier.s3.bucket=<your-aws-s3-bucket-name>
confluent.tier.s3.credentials.provider=static
confluent.tier.s3.access.key=<your-aws-access-key>
confluent.tier.s3.secret.key=<your-aws-secret-key>

Azure Storage

Örnek

Broker ayarları dosyasında şöyle yaparız

# Configure Tiered Storage Plugin
plugin.path=/path/to/tiered-storage-plugin

# Configure Azure Storage as a Tier
tier.azure.class=io.confluent.tieredstorage.azure.AzureBlobStorageProvider
tier.azure.name=azure-tier
tier.azure.azure.blob.account.name=your-storage-account-name
tier.azure.azure.blob.account.key=your-storage-account-key
tier.azure.azure.blob.container.name=your-container-name

# Configure Azure Storage Tier Properties
tier.azure.azure.blob.max.connections=10
tier.azure.azure.blob.block.size=67108864
tier.azure.azure.blob.buffer.size=67108864
tier.azure.azure.blob.timeout.ms=30000

Açıklaması şöyle

In the above code snippet:
1.Set plugin.path to the directory where the Tiered Storage Plugin JAR file is located.
2. Configure Azure Storage as a tier by setting the following properties:
- tier.azure.class: Specifies the class responsible for Azure Storage integration.
- tier.azure.name: Assigns a name to the Azure Storage tier (e.g., azure-tier).
- tier.azure.azure.blob.account.name: Specifies the name of your Azure Storage account.
- tier.azure.azure.blob.account.key: Specifies the access key for your Azure Storage account.
- tier.azure.azure.blob.container.name: Specifies the name of the Azure Storage container where Kafka data will be stored.

Kafka Çorbası

Thursday, May 18, 2023

Tiered Storage

No comments:

Post a Comment

Bufstream - Kafka Muadili

Report Abuse

Labels