Kafka, Sadly Its Time To Part Ways

I had big dreams for the perfect union between my company and Kafka.  I could see jagigabytes (technical term for a huge number) upon jagigabytes of data passing through our network from the massive clickstream data that we would produce.  The power of having our data in-house and not relying on the paid services to store and cull our data was huge.

That was the dream; and now for the reality :(.  We have tried to bend the will of Kafka to meet our use case but Kafka didn’t break.  I wanted badly for the pub/sub application to be able to work at our small scale.  When I say our scale, I mean somewhere south of 1000 messages per day for business transactions purposes.

My thinking was that if we could get it to work at our scale, then we would have learned a great deal to help us with my grander vision.  I can say that I achieved the goal of learning, but not much more.

The first issue that we had was the messages were not returned from the queue during a single fetch request.  I saw that during development, but I didn’t pay enough attention to what I was seeing.  That turned out to be a fatal flaw.

We were losing messages

When we configured our jobs to read from various topics, we configured them to poll at specific intervals.   When we spaced them out to an hour or greater, we were closing the window between the retention policy and the opportunities to read data.  For example, if we have a retention policy of 16 hours and a poll interval of one hour, then we have 16 chances to read data.  If during those 16 individual read attempts, data was not returned it was lost.

What happened is that we were missing critical event data and we couldn’t figure out why.  It took some time before I figured out that you have to ask for the data until it is returned.  That was issue number one.65831061

We were losing messages

Now that we were able to get the data back, all of a sudden all the data was gone.  This was really baffling!  I thought we had solved our problems with receiving the data, but to the outside it looked as if we were having the same issue again.  I couldn’t figure out why after 16 hours our queue was empty regardless of how recent the last message was published.

I did all the reading that someone should have to do in a lifetime (except for you, please continue reading) and I couldn’t solve it.  So I turned to the Kafka mailing list for help.  It turns out that Kafka will delete the log file with the message that is outside of the retention policy.  This was exactly what we were seeing.

We could send a steady stream of data and like clockwork, all of it would be gone once the flush began.  It turns out that the initial log file is a gigabyte in size.  Remember, my volumes are very low and we wouldn’t fill that up in a year.  That could be solved by setting the log file size really low, we set it to 1024B.

We were losing messages

That brings us to our third and last issue.  The straw that broke the camel’s back.  Nail in the coffin.  Ok, I will stop.  Now we are receiving data reliably and our logs files are right sized, what else could be going on?

With their rest client, there are two methods of committing back an offset when operating in a group.  You can auto-commit where you set your cursor to the last entry that was returned or you can wait and commit that cursor position once you are done with the data.  To be fair, we had some issues in our code that was causing the processing to halt and stop processing messages.  These were messages that were already committed, but were not processed.

Without the ability to grab a single message at a time we were stuck.  We had hoped that Confluent 3.0 (Kafka 0.10) was going to save the day with the max.poll.records, but they didn’t roll that into the rest client.  Disappointed, we realized that we had really hit a wall.

We sucked it up and decided to turn our backs to Kafka for now.  We were diligent to create abstractions that will allow us to change with reasonable ease.  We will be taking a day to research and design what the new solution will be.  I think that this was a good lesson on picking a solution that matches the current use case.  Even though I really wanted to set us up to use Kafka for my grander vision, it just wasn’t the right choice.

I haven’t turned my back on Kafka completely, I still think it is awesome and will have a home with us in the future.  Sadly, for now I can’t fit your size so I will have to leave you on the rack.  Goodbye.

 

 

Kafka Client for .NET

They say that every journey start with a single step, well even those can get screwed up.  I am going to attempt to write a comprehensive client for Kafka written in my native tongue, c# and I want to document the progress.

And now for the misstep… I have been working on this client for several days now and I have been making a great deal of progress.  I felt that this evening was a good enough time to add my work to Github.  Turns out that when Xamarin says that the folder is not empty, and do I want it to be deleted, they mean it.  I didn’t notice that the folder I had in the box was at the root of my projects.  Luckily this a fairly new laptop so I only had a single project.

The documentation that I am using as a reference guide is located here https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Requests.

Some of my first struggles were due to my lack of familiarity with the backus-naur syntax.  Another nuance that I glanced over where the default serialization that Kafka expects.  Obvious now, but i had missed some of the required around the inclusion of variable width sized fields such as strings.  In Kafka, they specify that for arrays it expects the field to be preceded by an 32-bit byte count.  For instance, an array of 32 fields you would add the 32 to the stream of data and then the content of the array.  The same pattern was true for string fields, except they need to be preceded by a 16-bit count.

Another big thing that luckily I picked up on early was that this was built on big endian so I had to reverse all of the bytes.  I am going to start the journey again tomorrow on the Metadata API.  Here is the git repo if you want to follow along.

https://github.com/hivie7510/KafkaSharp