Giriş... what exactly does tiered storage mean in the context of a streaming system? The basic idea is to only persist recent data to disk, and asynchronously move historical data to object storage where it can rest cheaply.
Neden Lazım
Açıklaması
şöyle. Veri miktarı büyüdükçe her şeyi
yerel diskte saklamak doğru olmuyor. Bu yüzden katmanlı depolamaya geçiliyor.
In a traditional Kafka deployment, data is stored on the local disks of Kafka brokers. However, as the volume of data grows, storing all the data on local disks can become expensive and less efficient. Tiered storage provides a solution to this problem by introducing the ability to store data in multiple tiers, typically using a combination of fast and expensive storage and slower and cheaper storage options.
Veri
1. Hot Data : Sıklıkla erişilen veri.
2. Cold Data : Daha az erişilen veri
olarak sınıflandırılıyor. Açıklaması
şöyle.
With tiered storage, you can define different storage tiers based on your requirements. For example, you can have a hot tier consisting of high-performance and expensive storage like solid-state drives (SSDs) or in-memory storage for storing frequently accessed or recent data. The hot tier ensures that the most critical data is readily available for fast processing.
On the other hand, you can have a cold tier composed of lower-cost and higher-capacity storage options such as traditional hard disk drives (HDDs) or cloud object storage. The cold tier is used for storing less frequently accessed or older data that may not require immediate processing.
Açıklaması
şöyle. Hot Data daha hızlı depolama sisteminde saklanır.
Tiered storage in Kafka allows you to store older data on cheaper, cost-effective storage systems while keeping more recent data on faster storage systems. This feature is available in Confluent Platform (a distribution of Apache Kafka), starting from version 5.4.
2. Ancak
Tiered Storage ile maliyetin gerçekten düştüğüne dair bir kanaat yok. Yani Tiered Storage
kullanılmıyor1. Increased Complexity and Operational Burden
Yani Tiered Storage her şeyi daha karmaşık hale getiriyor. Performans garantisi artık verilemiyor.
Çünkü Tiered Storage çözümleri veriyi bloklar halinde okuyorlar. Bu da yüzlerce megabyte büyüklüğündeki verinin önce indirilmesini gerektiriyor.
2. No Reduction in inter-zone Networking
Cloud disklerin pahalı olduğu düşünülüyor ancak esas maliyet cloud networking ile geliyor. Standart High Availability kurulumunda Kafka'nın çalışması için 3 tane farklı availability zone lazım. Ağ maliyeti de Kafka işletmenin maliyetinin %80'inin oluşturuyor. Yani Tiered Storage aslında en pahalı kısmı ele almıyor
3. Ayarlar
3.1 Hot Tier
1. Specify Hot Tier Directory
log.dirs Alanı
In the Kafka server properties, you need to define the directory path where the Hot Tier data will be stored. This is done using the log.dirs property. The specified path should point to a location on a storage device that offers high-speed access, such as SSDs or NVMe devices.
Örnek
# Server Properties for Hot Tier
log.dirs=/path/to/hot/tier
2. Topic Configuration
Configure Kafka topics to use the Hot Tier for storing their data. By default, Kafka automatically creates topics if they do not exist. However, in a tiered storage scenario, you might want to disable automatic topic creation to have more control over how topics are configured.
Örnek
# Disable Automatic Topic Creation
auto.create.topics.enable=false
# Topic-Specific Configuration for the Hot Tier
# Here, "my_hot_topic" is the name of the topic you want to configure for the Hot Tier.
# The log.dirs property points to the directory in the Hot Tier storage.
# You can also customize other topic-specific settings as needed.
topic.config.my_hot_topic=log.dirs=/path/to/hot/tier
3. Additional Tuning
Depending on your specific use case and requirements, you may need to tune other Kafka configurations for the Hot Tier. For example, you might adjust parameters related to retention policies, replication factor, and log segment sizes to optimize performance.
Örnek
# Adjusting Retention Policy (example: retain data for 7 days)
log.retention.hours=168
# Adjusting Replication Factor (example: set replication factor to 3 for fault tolerance)
default.replication.factor=3
# Log Segment Size (example: set log segment size to 1 GB)
log.segment.bytes=1073741824
4. Monitoring and Maintenance
Implement monitoring mechanisms to keep track of the Hot Tier’s performance, disk usage, and other relevant metrics. Regularly review these metrics to ensure that the Hot Tier is effectively handling the high-frequency data and making adjustments as needed.
5. Data Movement PoliciesConsider implementing data movement policies that define when and how data transitions from the Hot Tier to the Cold Tier. These policies could be based on criteria such as the age of the data or its access frequency. This ensures that only the most relevant and frequently accessed data remains in the Hot Tier.
Örnek
dataMovementPolicy:
criteria: age
threshold: 7d
In this example, data older than 7 days is configured to be automatically moved from the Hot Tier to the Cold Tier. Configuring the Hot Tier is a crucial aspect of optimizing Kafka Tiered Storage for performance.
3.2 Cold Tier
S3
Örnek
confluent.tier.feature=true
confluent.tier.backend=S3
confluent.tier.s3.region=<your-aws-s3-region>
confluent.tier.s3.bucket=<your-aws-s3-bucket-name>
confluent.tier.s3.credentials.provider=static
confluent.tier.s3.access.key=<your-aws-access-key>
confluent.tier.s3.secret.key=<your-aws-secret-key>
Azure Storage
Örnek
Broker ayarları dosyasında şöyle
yaparız
# Configure Tiered Storage Plugin
plugin.path=/path/to/tiered-storage-plugin
# Configure Azure Storage as a Tier
tier.azure.class=io.confluent.tieredstorage.azure.AzureBlobStorageProvider
tier.azure.name=azure-tier
tier.azure.azure.blob.account.name=your-storage-account-name
tier.azure.azure.blob.account.key=your-storage-account-key
tier.azure.azure.blob.container.name=your-container-name
# Configure Azure Storage Tier Properties
tier.azure.azure.blob.max.connections=10
tier.azure.azure.blob.block.size=67108864
tier.azure.azure.blob.buffer.size=67108864
tier.azure.azure.blob.timeout.ms=30000
In the above code snippet:
1.Set plugin.path to the directory where the Tiered Storage Plugin JAR file is located.
2. Configure Azure Storage as a tier by setting the following properties:
- tier.azure.class: Specifies the class responsible for Azure Storage integration.
- tier.azure.name: Assigns a name to the Azure Storage tier (e.g., azure-tier).
- tier.azure.azure.blob.account.name: Specifies the name of your Azure Storage account.
- tier.azure.azure.blob.account.key: Specifies the access key for your Azure Storage account.
- tier.azure.azure.blob.container.name: Specifies the name of the Azure Storage container where Kafka data will be stored.