Add in the solution of bootstrapping and using the latest timestamp.

perkss · perkss · commit 2f35861ba7a3 · 2021-01-24T09:26:41.000Z
diff --git a/kotlin-kafka-streams-examples/README.md b/kotlin-kafka-streams-examples/README.md
@@ -100,7 +100,7 @@ docker run --rm  --net=host confluentinc/cp-kafka:latest kafka-topics --create -
 ```
 
 ```shell
-docker run --rm  --net=host confluentinc/cp-kafka:latest kafka-console-consumer --bootstrap-server localhost:9092 --topic namedocker run --rm  --net=host confluentinc/cp-kafka:latest kafka-console-consumer --bootstrap-server localhost:9093 --topic name --property print.key=true --from-beginning
+docker run --rm  --net=host confluentinc/cp-kafka:latest kafka-console-consumer --bootstrap-server localhost:9093 --topic name --property print.key=true --from-beginning
 docker run --rm  --net=host confluentinc/cp-kafka:latest kafka-console-consumer --bootstrap-server localhost:9093 --topic name-formatted --property print.key=true --from-beginning
 ```
 
@@ -110,6 +110,13 @@ docker exec -it kafka-3 kafka-console-producer --broker-list kafka-2:29092  --to
 
 ### Test semantics
 
+These tests will use the standard properties for cache and buffering. Later we will run the same tests with these turned
+off as these will compact the data in memory which may result in different results to with them on a reference
+on [memory management](https://docs.confluent.io/platform/current/streams/developer-guide/memory-mgmt.html).
+
+I expect that these initial tests with buffers and cache enabled will compact the data for us and only show the last
+key.
+
 For the first test we will run just a KTable that consumes the messages off a compacted topic after two messages with
 the same key have been placed on a topic. I would expect that this topology will process all messages on start up
 including duplicate keys so we see the full history following streaming semantics.
@@ -134,8 +141,8 @@ Topic: name	PartitionCount: 3	ReplicationFactor: 3	Configs: cleanup.policy=compa
 Put two messages on the `name` topic with the same key when the application is stopped.
 
 ```shell
-tom	perks
-tom matthews
+tom:perks
+tom:matthews
 ```
 
 ```shell
@@ -240,6 +247,7 @@ If we were to rekey and join with a different key how are the semantics well let
 ```
 
 Put these messages onto the compact topic `name` whilst the application is down.
+
 ```shell
 sarah:mark1
 mark:sarah1
@@ -419,6 +427,237 @@ steven holly
 As expected from documentation the GlobalKTable will load up all the data first before starting the application. If this
 is the case then we will always join against the tables latest value.
 
+#### Turning off cache tests
+
+Back to the simple self join example but with cache turned off.
+
+```shell
+  streamsConfiguration[StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG] = 0
+```
+
+KTable self join.
+
+```shell
+  val nameKTable = streamsBuilder
+            .table("name", Consumed.with(Serdes.String(), Serdes.String()))
+
+  nameKTable
+            .toStream()
+            .peek { key, value ->
+                logger.info("Processing {}, {}", key, value)
+            }
+            .leftJoin(nameKTable, ValueJoiner { value1, value2 ->
+                logger.info("Joining the Stream Name {} to the KTable Name {}", value1, value2)
+                value2
+            }, Joined.with(Serdes.String(), Serdes.String(), Serdes.String()))
+            .to("name-formatted", Produced.with(Serdes.String(), Serdes.String()))
+```
+
+We send onto the name topic these values before starting the app with remember this topic is compacted.
+
+```shell
+tom:perks
+tom:matthews
+tom:stevens
+sharon:news
+sharon:car
+tom:party
+```
+
+As expected we now process all the values. The buffering and cache layer does not merge the records.
+
+```shell
+Processing tom, perks
+Joining the Stream Name perks to the KTable Name perks
+Processing tom, matthews
+Joining the Stream Name matthews to the KTable Name matthews
+Processing tom, stevens
+Joining the Stream Name stevens to the KTable Name stevens
+Processing sharon, news
+Joining the Stream Name news to the KTable Name news
+Processing sharon, car
+Joining the Stream Name car to the KTable Name car
+Processing tom, party
+Joining the Stream Name party to the KTable Name party
+```
+
+Output to the topic all the values.
+
+```shell
+perks
+matthews
+stevens
+news
+car
+party
+```
+
+Lets do this same example and turn the cache back on.
+
+```shell
+tom:perks
+tom:matthews
+tom:stevens
+sharon:news
+sharon:car
+tom:party
+```
+
+Results in the data being merged which is what we expected so there is no guarantee of compacting the data, it depends
+on the `streamsConfiguration[StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG]` and consider `COMMIT_INTERVAL_MS_CONFIG`.
+
+```shell
+Processing sharon, car
+Joining the Stream Name car to the KTable Name car
+Processing tom, party
+Joining the Stream Name party to the KTable Name party
+```
+
+Now as per this [JIRA](https://issues.apache.org/jira/browse/KAFKA-4113) you can set the timestamps of messages to 0 and
+this will ensure the KTable behaves like a GlobalKTable.
+
+Now lets follow this advice and use the custom timestamp extractor lets put the same data onto the topic. This time we
+expect even with no cache that the data will only join with the latest timestamp record.
+
+The data will still stream in order but the join will only ever be with the latest.
+
+Place the data on the topic done new data this time
+
+```shell
+clark:perks
+clark:matthews
+clark:stevens
+sarah:news
+sarah:car
+clark:party
+```
+
+Interesting with the cache disabled and this custom timestamp extractor using zero we still process all events and join
+with the same timstamp.
+
+```shell
+Processing sarah, news
+Joining the Stream Name news to the KTable Name news
+Processing sarah, car
+Joining the Stream Name car to the KTable Name car
+Processing clark, perks
+Joining the Stream Name perks to the KTable Name perks
+Processing clark, matthews
+Joining the Stream Name matthews to the KTable Name matthews
+Processing clark, stevens
+Joining the Stream Name stevens to the KTable Name stevens
+Processing clark, party
+Joining the Stream Name party to the KTable Name party
+```
+
+If read further up you see why:
+
+```shell
+What you could do it, to write a custom timestamp extractor, and return `0` for each table side record and wall-clock time for each stream side record. In `extract()` to get a `ConsumerRecord` and can inspect the topic name to distinguish between both. Because `0` is smaller than wall-clock time, you can "bootstrap" the table to the end of the topic before any stream-side record gets processed.
+```
+
+We need to set zero only for the bootstrap but here we are doing a self join.
+
+Therefore we can implement a custom transformer and change the timestamp to the correct one on the stream flow whilst
+setting to zero on the KTable consume.
+
+Here is the customer Timestamp extractor where all values are set to timestamp zero.
+
+```shell
+class IgnoreTimestampExtractor : TimestampExtractor {
+    override fun extract(record: ConsumerRecord<Any, Any>?, partitionTime: Long): Long {
+        // Ignore the timestamp and then start up.
+        return 0
+    }
+}
+```
+
+Now we set this on the KTable to consume.
+
+```shell
+ val nameKTable = streamsBuilder
+            .table(
+                "name",
+                Consumed.with(Serdes.String(), Serdes.String()).withTimestampExtractor(IgnoreTimestampExtractor())
+            )
+```
+
+Now we setup the customer processor to change the stream timestamp.
+
+```shell
+ override fun transform(key: String?, value: String?): KeyValue<String, String>? {
+        // In reality use the timestamp on the event
+        context.forward(
+            key,
+            value,
+            To.all().withTimestamp(LocalDateTime.now().toInstant(ZoneOffset.UTC).toEpochMilli())
+        )
+        return null
+    }
+```
+
+Now we place these messages onto the compacted topic.
+
+```shell
+clark:perks
+clark:matthews
+clark:stevens
+sarah:news
+sarah:car
+clark:party
+```
+
+We start up the application and we can see it work correctly. We only join on the latest value and therefore we can
+ensure if we compare timestamps we only use the latest value on the key when we join against the KTable. and could
+filter older data in the stream.
+
+```shell
+Stream Processing Key sarah, Value news
+Stream Processing Key sarah, Value car
+Stream Processing Key clark, Value perks
+Stream Processing Key clark, Value matthews
+Stream Processing Key clark, Value stevens
+Stream Processing Key clark, Value party
+Joining the Stream Name news to the KTable Name car
+Changed Timestamp Key sarah, Value car
+Joining the Stream Name car to the KTable Name car
+Changed Timestamp Key clark, Value perks
+Joining the Stream Name perks to the KTable Name party
+Changed Timestamp Key clark, Value matthews
+Joining the Stream Name matthews to the KTable Name party
+Changed Timestamp Key clark, Value stevens
+Joining the Stream Name stevens to the KTable Name party
+Changed Timestamp Key clark, Value party
+Joining the Stream Name party to the KTable Name party
+```
+
+Now if we add back in the rekey example and run the data we get.
+
+```shell
+mark:sarah1
+sarah:mark2
+sarah:mark3
+mark:sarah2
+```
+
+```shell
+Stream Processing Key sarah, Value mark1
+Changed Timestamp Key sarah, Value mark1
+Stream Processing Key mark, Value sarah1
+Changed Timestamp Key mark, Value sarah1
+Stream Processing Key sarah, Value mark2
+Changed Timestamp Key sarah, Value mark2
+Stream Processing Key sarah, Value mark3
+Changed Timestamp Key sarah, Value mark3
+Stream Processing Key mark, Value sarah2
+Changed Timestamp Key mark, Value sarah2
+Joining the Stream Name mark1 to the KTable Name sarah2
+Joining the Stream Name sarah1 to the KTable Name mark3
+Joining the Stream Name mark2 to the KTable Name sarah2
+Joining the Stream Name mark3 to the KTable Name sarah2
+Joining the Stream Name sarah2 to the KTable Name mark3
+```
+
 ## Tescontainers Integration Tests
 
 Required Docker to be running.
diff --git a/kotlin-kafka-streams-examples/src/main/kotlin/com/perkss/kafka/reactive/AppConfig.kt b/kotlin-kafka-streams-examples/src/main/kotlin/com/perkss/kafka/reactive/AppConfig.kt
@@ -36,8 +36,13 @@ class AppConfig {
         streamsConfiguration[KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG] = props.schemaRegistry
         streamsConfiguration[StreamsConfig.STATE_DIR_CONFIG] = props.stateDir
         streamsConfiguration[ConsumerConfig.AUTO_OFFSET_RESET_CONFIG] = "earliest"
-        streamsConfiguration[StreamsConfig.TOPOLOGY_OPTIMIZATION] =
+        streamsConfiguration[StreamsConfig.TOPOLOGY_OPTIMIZATION_CONFIG] =
             StreamsConfig.OPTIMIZE// do not create internal changelog have to have source topic as compact https://stackoverflow.com/questions/57164133/kafka-stream-topology-optimization
+        // disable cache for testing
+        streamsConfiguration[StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG] = 0
+        streamsConfiguration[StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG] = Serdes.String()::class.java
+        streamsConfiguration[StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG] = Serdes.String()::class.java
+
         return streamsConfiguration
     }
 
diff --git a/kotlin-kafka-streams-examples/src/main/kotlin/com/perkss/kafka/reactive/BootstrapSemanticsSelfJoinTopology.kt b/kotlin-kafka-streams-examples/src/main/kotlin/com/perkss/kafka/reactive/BootstrapSemanticsSelfJoinTopology.kt
@@ -1,28 +1,60 @@
 package com.perkss.kafka.reactive
 
 import org.apache.kafka.common.serialization.Serdes
+import org.apache.kafka.streams.KeyValue
 import org.apache.kafka.streams.StreamsBuilder
 import org.apache.kafka.streams.Topology
-import org.apache.kafka.streams.kstream.Consumed
-import org.apache.kafka.streams.kstream.Joined
-import org.apache.kafka.streams.kstream.Produced
-import org.apache.kafka.streams.kstream.ValueJoiner
+import org.apache.kafka.streams.kstream.*
+import org.apache.kafka.streams.processor.ProcessorContext
+import org.apache.kafka.streams.processor.To
 import org.slf4j.LoggerFactory
+import java.time.LocalDateTime
+import java.time.ZoneOffset
 import java.util.*
 
+class TimestampTransformer : Transformer<String, String, KeyValue<String, String>?> {
+
+    private lateinit var context: ProcessorContext
+    override fun init(context: ProcessorContext) {
+        this.context = context
+    }
+
+    override fun close() {
+    }
+
+    override fun transform(key: String?, value: String?): KeyValue<String, String>? {
+        // In reality use the timestamp on the event
+        context.forward(
+            key,
+            value,
+            To.all().withTimestamp(LocalDateTime.now().toInstant(ZoneOffset.UTC).toEpochMilli())
+        )
+        return null
+    }
+}
+
+
 // Results in consuming only the latest message from the Topic
 object BootstrapSemanticsSelfJoinTopology {
 
     private val logger = LoggerFactory.getLogger(BootstrapSemanticsSelfJoinTopology::class.java)
 
     fun build(streamsBuilder: StreamsBuilder, properties: Properties): Topology {
         val nameKTable = streamsBuilder
-            .table("name", Consumed.with(Serdes.String(), Serdes.String()))
+            .table(
+                "name",
+                Consumed.with(Serdes.String(), Serdes.String()).withTimestampExtractor(IgnoreTimestampExtractor())
+            )
 
         nameKTable
             .toStream()
             .peek { key, value ->
-                logger.info("Processing {}, {}", key, value)
+                logger.info("Stream Processing Key {}, Value {}", key, value)
+            }
+            .transform(
+                { TimestampTransformer() })
+            .peek { key, value ->
+                logger.info("Changed Timestamp Key {}, Value {}", key, value)
             }
             .selectKey { key, value ->
                 val re = Regex("[^A-Za-z]")
diff --git a/kotlin-kafka-streams-examples/src/main/kotlin/com/perkss/kafka/reactive/IgnoreTimestampExtractor.kt b/kotlin-kafka-streams-examples/src/main/kotlin/com/perkss/kafka/reactive/IgnoreTimestampExtractor.kt
@@ -0,0 +1,12 @@
+package com.perkss.kafka.reactive
+
+import org.apache.kafka.clients.consumer.ConsumerRecord
+import org.apache.kafka.streams.processor.TimestampExtractor
+
+class IgnoreTimestampExtractor : TimestampExtractor {
+    override fun extract(record: ConsumerRecord<Any, Any>?, partitionTime: Long): Long {
+        // Ignore the timestamp to enable a KTable to act like a global KTable and bootstrap fully
+        // before processing it against joins.
+        return 0
+    }
+}