Kafka, Sadly Its Time To Part Ways

I had big dreams for the perfect union between my company and Kafka.  I could see jagigabytes (technical term for a huge number) upon jagigabytes of data passing through our network from the massive clickstream data that we would produce.  The power of having our data in-house and not relying on the paid services to store and cull our data was huge.

That was the dream; and now for the reality :(.  We have tried to bend the will of Kafka to meet our use case but Kafka didn’t break.  I wanted badly for the pub/sub application to be able to work at our small scale.  When I say our scale, I mean somewhere south of 1000 messages per day for business transactions purposes.

My thinking was that if we could get it to work at our scale, then we would have learned a great deal to help us with my grander vision.  I can say that I achieved the goal of learning, but not much more.

The first issue that we had was the messages were not returned from the queue during a single fetch request.  I saw that during development, but I didn’t pay enough attention to what I was seeing.  That turned out to be a fatal flaw.

We were losing messages

When we configured our jobs to read from various topics, we configured them to poll at specific intervals.   When we spaced them out to an hour or greater, we were closing the window between the retention policy and the opportunities to read data.  For example, if we have a retention policy of 16 hours and a poll interval of one hour, then we have 16 chances to read data.  If during those 16 individual read attempts, data was not returned it was lost.

What happened is that we were missing critical event data and we couldn’t figure out why.  It took some time before I figured out that you have to ask for the data until it is returned.  That was issue number one.65831061

We were losing messages

Now that we were able to get the data back, all of a sudden all the data was gone.  This was really baffling!  I thought we had solved our problems with receiving the data, but to the outside it looked as if we were having the same issue again.  I couldn’t figure out why after 16 hours our queue was empty regardless of how recent the last message was published.

I did all the reading that someone should have to do in a lifetime (except for you, please continue reading) and I couldn’t solve it.  So I turned to the Kafka mailing list for help.  It turns out that Kafka will delete the log file with the message that is outside of the retention policy.  This was exactly what we were seeing.

We could send a steady stream of data and like clockwork, all of it would be gone once the flush began.  It turns out that the initial log file is a gigabyte in size.  Remember, my volumes are very low and we wouldn’t fill that up in a year.  That could be solved by setting the log file size really low, we set it to 1024B.

We were losing messages

That brings us to our third and last issue.  The straw that broke the camel’s back.  Nail in the coffin.  Ok, I will stop.  Now we are receiving data reliably and our logs files are right sized, what else could be going on?

With their rest client, there are two methods of committing back an offset when operating in a group.  You can auto-commit where you set your cursor to the last entry that was returned or you can wait and commit that cursor position once you are done with the data.  To be fair, we had some issues in our code that was causing the processing to halt and stop processing messages.  These were messages that were already committed, but were not processed.

Without the ability to grab a single message at a time we were stuck.  We had hoped that Confluent 3.0 (Kafka 0.10) was going to save the day with the max.poll.records, but they didn’t roll that into the rest client.  Disappointed, we realized that we had really hit a wall.

We sucked it up and decided to turn our backs to Kafka for now.  We were diligent to create abstractions that will allow us to change with reasonable ease.  We will be taking a day to research and design what the new solution will be.  I think that this was a good lesson on picking a solution that matches the current use case.  Even though I really wanted to set us up to use Kafka for my grander vision, it just wasn’t the right choice.

I haven’t turned my back on Kafka completely, I still think it is awesome and will have a home with us in the future.  Sadly, for now I can’t fit your size so I will have to leave you on the rack.  Goodbye.