Kafka, Sadly Its Time To Part Ways

I had big dreams for the perfect union between my company and Kafka.  I could see jagigabytes (technical term for a huge number) upon jagigabytes of data passing through our network from the massive clickstream data that we would produce.  The power of having our data in-house and not relying on the paid services to store and cull our data was huge.

That was the dream; and now for the reality :(.  We have tried to bend the will of Kafka to meet our use case but Kafka didn’t break.  I wanted badly for the pub/sub application to be able to work at our small scale.  When I say our scale, I mean somewhere south of 1000 messages per day for business transactions purposes.

My thinking was that if we could get it to work at our scale, then we would have learned a great deal to help us with my grander vision.  I can say that I achieved the goal of learning, but not much more.

The first issue that we had was the messages were not returned from the queue during a single fetch request.  I saw that during development, but I didn’t pay enough attention to what I was seeing.  That turned out to be a fatal flaw.

We were losing messages

When we configured our jobs to read from various topics, we configured them to poll at specific intervals.   When we spaced them out to an hour or greater, we were closing the window between the retention policy and the opportunities to read data.  For example, if we have a retention policy of 16 hours and a poll interval of one hour, then we have 16 chances to read data.  If during those 16 individual read attempts, data was not returned it was lost.

What happened is that we were missing critical event data and we couldn’t figure out why.  It took some time before I figured out that you have to ask for the data until it is returned.  That was issue number one.65831061

We were losing messages

Now that we were able to get the data back, all of a sudden all the data was gone.  This was really baffling!  I thought we had solved our problems with receiving the data, but to the outside it looked as if we were having the same issue again.  I couldn’t figure out why after 16 hours our queue was empty regardless of how recent the last message was published.

I did all the reading that someone should have to do in a lifetime (except for you, please continue reading) and I couldn’t solve it.  So I turned to the Kafka mailing list for help.  It turns out that Kafka will delete the log file with the message that is outside of the retention policy.  This was exactly what we were seeing.

We could send a steady stream of data and like clockwork, all of it would be gone once the flush began.  It turns out that the initial log file is a gigabyte in size.  Remember, my volumes are very low and we wouldn’t fill that up in a year.  That could be solved by setting the log file size really low, we set it to 1024B.

We were losing messages

That brings us to our third and last issue.  The straw that broke the camel’s back.  Nail in the coffin.  Ok, I will stop.  Now we are receiving data reliably and our logs files are right sized, what else could be going on?

With their rest client, there are two methods of committing back an offset when operating in a group.  You can auto-commit where you set your cursor to the last entry that was returned or you can wait and commit that cursor position once you are done with the data.  To be fair, we had some issues in our code that was causing the processing to halt and stop processing messages.  These were messages that were already committed, but were not processed.

Without the ability to grab a single message at a time we were stuck.  We had hoped that Confluent 3.0 (Kafka 0.10) was going to save the day with the max.poll.records, but they didn’t roll that into the rest client.  Disappointed, we realized that we had really hit a wall.

We sucked it up and decided to turn our backs to Kafka for now.  We were diligent to create abstractions that will allow us to change with reasonable ease.  We will be taking a day to research and design what the new solution will be.  I think that this was a good lesson on picking a solution that matches the current use case.  Even though I really wanted to set us up to use Kafka for my grander vision, it just wasn’t the right choice.

I haven’t turned my back on Kafka completely, I still think it is awesome and will have a home with us in the future.  Sadly, for now I can’t fit your size so I will have to leave you on the rack.  Goodbye.

 

 

Advertisements

Self Organized Architecture

How does an agile developer know which frameworks can be used?  How does the product team know which open source licenses can be used in their products?  In a multi-team environment, who coordinates the architecture to ensure that it is sustainable?  These are the questions that I want to examine.

One of the tenets of the agile manifesto states the architecture is best crafted by the product team.

The best architectures, requirements, and designs emerge from self-organizing teams” – agile manifesto

Framework du jour is something that is hot today and is the new shiny object.  I have found its use to be prevalent in web applications and the JavaScript libraries. There is something to be said about the boundaries that should be defined.  If you have not seen framework du jour, please trust me I have witnessed and taken part in this practice.

What are the suitability parameters that aid in the selection process when choosing a supporting framework? Is the longevity of the project sufficient, can the open source license (oss) license model be adhered to?  Is the library actually support by an organization or by an individual?  These are important questions that should be asked when the product team wants to augment their code with external projects.

Some oversight should be there to make sure that the products and their dependencies are built on a solid foundation.  One of the gotchas that is present in most projects are their use of oss.  Who is there to monitor licenses that are being used to ensure compliance?  I do not see this to be any different than making sure that the appropriate Microsoft licenses are being maintained.

Open source license violations are not monitored like the those of Microsoft’s software.  For the majority of the users of oss, their use will go unseen from the community as well as any corporate oversight.  This does not mean that it is OK, but without oversight the likelihood of violation is greater.

The manifesto quote above speaks about the architecture spawning from within the product team.  This makes sense to me when you look at a team that is chartered to build a single product.  Does this statement still hold true when you span across several products and their teams?  If they all stay within their boundaries, this statement still makes sense.

Who designs the architecture when the product teams must integrate with each other?  That is a question that I have be trying to understand for quite some time.  There is nothing that precludes the product teams from building an excellent integration for their products, but how do they match to the long term IT plan?

One of the things that I see and read is how the extreme agile practices are in use at spotify or facebook.  These are awesome examples of how teams can quickly iterate over a product to release amazing software very quickly.  Can this be applied in a corporate environment where you are building tools to support internal business users?  I don’t know the answer, but I believe that you can not do a 1:1 comparison of an internal business application with a social product where there are little or no regulations.  Also business applications typically have uptime requirements and other SLAs and a down system sings a much different song.

I believe that product teams should be equipped with the parameters so that they can pick the best fit frameworks to help them deliver awesome software.  It makes sense since they are closest to the problem.  Open source licensing should still be monitored, because that is just being a good citizen.

Building an infrastructure to support the organizational growth is a different story.  I think there needs to be some central body, whether an architect or not, to look at the future of the portfolio and make sure that the products are driving that direction.  I would love to hear others opinions on this topic.  Cheers

First Chance Exception Settings, Your Friend

“I am trying to debug, but this stupid exception keeps happening”.  I have seen that situation play out many times over the years.  You are trying to exercise your code, but the same spot keeps throwing an exception.  This doesn’t impact your functionality, but it’s an annoyance.

Sometimes it takes a while for the frustration to hit a peak, but when it does you are besides yourself.  There is functionality in Visual Studio to help you out, but before that let’s talk philosophy.

There are many schools of thought with regards to developing code.  On one side, if your tests (unit,integration,etc.) are not passing then you cannot check-in.  This would imply that if the code base you are working on is throwing exceptions, then you cannot check-in until it is fixed.  Under this philosophy, everything stops until the code is in good working order.

On the other side, you have blinders on and you only focus on getting your code to work.  In this scenario it is OK to skip over some exceptions, because it is someone else’s problem.  I don’t have to preach about the issue with the “someone else’s” problem, but that does not help your team.

The scenario that I left out  is less philosophy and more practice.  Sometimes in the code there are bugs that are OK to not be fixed.  This could be for many reasons, like priority or planned obsolescence.

I alway recommend that my team members use a little known set of toggles in Visual Studio called Exception Settings.  Exception settings allow you to have a first crack at code (first chance) that is about to throw an exception.

ExceptionSettings

Without having this functionality turned on, the code will throw and you may not be able to proceed.  The first chance exceptions would allow you to move the stack pointer to a different line that would function normally and continue.  This is very powerful when you are trying to figure out where an exception is coming from and the state of the system (threads, call stack, variables, etc…) when it happens.

I have been in situations where it was OK and even expected for code to throw, but it I don’t want to have to view it each time.  In this case, Visual Studio allows you to deselect the exceptions that you don’t care about.  You can see below that I chosen to to break on all exceptions except for Microsft.JScript.JScriptException.

ExceptionSettings_CLR

To use this functionality you have to set these toggles in advance of starting your application.   Consider this, you are running a long process, but you got interrupted by an exception that you did not expect, but you do not care if it throws.  In that case, you have the option to turn off first chance exceptions for the exception type for all future executions.  You need only deselect the “Break when this exception type is thrown” checkbox.

ExceptionSettingsDisable

I the case above, the exception is expected as part of the SSPI authentication process so I can ignore this exception.

It doesn’t matter which school of thought you or your team subscribes to, but the ability to toggle first chance exceptions is an important hammer to have in your toolbox.

Side Bar:  I can’t recommend enough to turn on this functionality.  It has helped me catch a lot of bugs before they got into the wild.  Hope this helps, cheers.

SSH.NET & echo | sudo -S

This is an extension on yesterdays post about getting the test harness to connect to vSphere.  I am going to show how to use SSH.NET to run commands on a server, but more specifically how to make sudo calls remotely.

The goal for today was to take the new fresh vm server and install and configure Kafka.  Sounds easy, right?  I use SSH.NET to upload all of the files that I need as well as a shell script to orchestrate the operations.  We are using systemctl as the daemon manager, so we need the *.service files to be in the right directory.

/usr/lib/systemd/system

There is one snag.  You need to sudo because that directory is protected. Looking at some of the options for sudo, I came across the -S toggle.  This allows sudo to take the password from the standard input.

echo "my_password" | sudo mv myfile /protected_directory/

This will take the password and pipe it into the sudo move command.  I tried a bunch of ways to try to make this work.

using (var client = new SshClient("my_host_server", "my_user_name", "my_password"))
{
     client.Connect();
     var command = 
         client.CreateCommand(
         "echo 'my_password' | sudo -S sh ./home/my_user_name/scripts/install_and_configure.sh");
     
     var output = command.Execute();
     client.Disconnect();
}

This command works great on the server, but it doesn’t work in the SshClient’s session. I tried a bunch of variations of the code above, but none of them worked.  It was pretty frustrating. If you take a peek at the out the Execute() method, you will see the message sorry you need a tty to run sudo.  This is a significant clue.  A tty, or terminal emulator is what you would use if you created a ssh session with putty.   When we are using SSH.Net like we are above, we do not have a virtual terminal running.  That is what the error is telling us.  I started to piece together what might look like a solution after googling the interwebs.

download (4)

We need to somehow emulate a tty terminal using the library.

First thing that we need to do is create a new client and create a session.

SshClient client = new SshClient(server_address, 22, login, password);
client.Connect();

We need to start creating the terminal that will be used by the ShellStream.  First we have to create a dictionary of the terminal modes that we want to enable.

IDictionary<Renci.SshNet.Common.TerminalModes, uint> modes = 
new Dictionary<Renci.SshNet.Common.TerminalModes, uint>();
termkvp.Add(Renci.SshNet.Common.TerminalModes.ECHO, 53);

Adding the terminal mode to 53 or ECHO, will enable the echo functionality.  Now we need to create our ShellStream.  So we need to specify the terminal emulator that we want, the dimensions for the terminal and lastly the modes that we want to enable.

ShellStream shellStream = 
sshClient.CreateShellStream("xterm", 80, 24, 800, 600, 1024, modes);

Now that we have a terminal emulator to work with, we can start sending commands.  There are three commands that we need and they each will have a response that we expect.

  1. Login
  2. Send our sudo command
  3. Send the password

We have already created our session, logged in, so there should be an output waiting for us.  After we send our command, we should expect the password prompt.  Once we know that we are in the right spot, we can forward on the password.

var output = shellStream.Expect(new Regex(@"[$>]")); 

shellStream.WriteLine("sudo sh /home/my_user_name/scripts/install_and_configure.sh"); 
output = shellStream.Expect(new Regex(@"([$#>:])"));
shellStream.WriteLine(password);

At this point, we should have execute our install_and_configure.sh script successfully. Putting it all together:

SshClient client = new SshClient(server_address, 22, login, password);
client.Connect();

IDictionary<Renci.SshNet.Common.TerminalModes, uint> modes = 
new Dictionary<Renci.SshNet.Common.TerminalModes, uint>();
termkvp.Add(Renci.SshNet.Common.TerminalModes.ECHO, 53);

ShellStream shellStream = 
sshClient.CreateShellStream("xterm", 80, 24, 800, 600, 1024, modes);
var output = shellStream.Expect(new Regex(@"[$>]")); 

shellStream.WriteLine("sudo sh /home/my_user_name/scripts/install_and_configure.sh"); 
output = shellStream.Expect(new Regex(@"([$#>:])"));
shellStream.WriteLine(password);
client.Disconnect();

That is pretty much it.  I hope that this helps someone! Cheers

 

Eating Crow

download (1)

Last year we started to work on a project that was pretty high profile within the business. The dates had been adjusted several times once we had more information, which added more and more levels of stress.  In parallel, we have been re-architecting our platform so there were opportunities to make changes in how we process business transactions.  Based on these factors we had to make several design decisions.

We decided that we wanted to start making our business transactions more asynchronous. One of the easy ways to do this is with message queueing.  Leading up to this, for another initiative, I had been doing some investigation into the various queueing tools that are available in both open source and commercial and my conclusion was that Kafka was the choice that we should use.

Given the tight schedule we decided that we will go with it for now, foregoing any prototyping.  I had ulterior motives for choosing Kafka.  We currently shuttle all of our click data to Adobe SiteCatalyst, which many of us feel we should keep that data internally. Kafka was designed at LinkedIn to handle a ridiculous amount of data that they were producing.  I believed that by choosing Kafka as our message queue we would be setup to bring our data in-house.

When started to integrate the queue into our processes, the lack of prototyping that we skipped floated to the surface.  I took on the task to create the infrastructure so that the client application can leverage it.  I learned that there were a lot of things that weren’t as straightforward as I thought.

Some of the challenges were the available tooling to interact with the queue.  I tried a few of the libraries that were out there, but for one reason or another I didn’t think that they would meet our needs.  I decided to use the rest proxy from Confluent.  This part of the integration was promising, everyone knows rest!  There was one problem…this rest proxy is stateful.

I don’t want to detail all of the challenges that we encountered and continue to in this post, maybe another day.  Skipping ahead, we have turned off the queue for now until we can build integrations tests to make sure that Kafka or any queue will fit our needs.  It may turn out that we need to choose a different technology like RabbitMQ or ZeroMQ, but I will eat that crow when/if the time comes.  I still believe that Kafka is a great tool and can meet our needs short and long term.  For now, I need to iron out all of the wrinkles so we can get it back into production.  I wonder what crow tastes like if you brine it first?

 

 

 

 

Call for papers, yeah right!

Lucky for me, a co-worker turned me on to Audible.com.  For the last 2 weeks I have listened to one lecture series and two books (well 1 1/2, I round up).  One of the books that I am reading right now is John Sonmez’s book Soft Skills: The Software Developer’s Life Manual.

One of the topics that he reiterates, a bunch of times, is about marketing your brand.  One of the methods that he is adamant about is blogging.  He remarks that a lot of engineers either do not blog at all or do not blog regularly. The feedback that he gets from engineers is that they don’t have anything to say or they don’t have time.  I have both!

The question that I need to ask myself, is why don’t I do it regularly?  I guess there are a bunch of reasons that I could proclaim, but it is unlikely that they are worth a nickel.  One of the things that he said that makes a lot of sense for me is to take note of the ideas that I have, the things that interest me and the trends that I see.  That is the reason for the title “Call For Papers”.  This is my call to come up with ideas that are blog worthy!

Here are some topics that I think are worthwhile:

  • Docker
  • Kafka
  • Zookeeper
  • DDD
  • Architecture
  • Home Automation

These are all relevant in my work today.  Let’s see how this goes.

 

The Unseen Challenges With The Speed Of Innovation

 

Listening to the keynote presentation at SACon this morning, Rebecca Parsons from Thoughtworks made an interesting point.  She spoke on the “Evolution of evolutionary architecture”.  She compared the technology roadmap today to the one of decades past.

In her talk, she said that in the past roadmaps could be projected out years and an some cases a decade whereas now, as she says, may be as little as 10 months.  As an architect, I think that is a very important point.

There are many factors outside of technology that are impacted by that acceleration.  My last role  was as the development manager and I faced the technology challenge daily.  When recruiting younger engineers you have to present a company that is fun, cutting edge and well compensated.  Compensation and fun are something that is relatively easy to achieve, but technology is a different story.

As an architect, it is my responsibility to help guide the enterprise towards responsible options, but at the same time using the best tool for the job.  The responsible part is where the real challenge lies.

The speed at which new tools and framework are being rolled out is astonishing.  How can you be on the latest and greatest and still provide a safe path for the company to move forward?  Looking back at the recruiting challenge, how can you also remain sexy if you are not on the bleeding edge?  I don’t know the answer and I am not sure that there is one.