I have found during my career that I am pretty good at problem solving when it comes to software development. I am only now starting to realize that if I had to go back and clearly articulate my choices, it might no go very well. In my position as an Architect, I can no longer rely on my simple problem solving to be successful. In a recent post I also talked about my inefficient communication.
In two weeks, some of the senior technical leadership from other companies in our corporate family are coming to our facility. I want to present where we are on our platform re-architecture project. I will have to clearly communicate the problem statements and prove that my decisions were sound. Needless to say, I am excited and nervous at the same time. I love talking and presenting on technology, but I feel that this one will be different.
Many moons have passed since I analyzed the problem space, so regrouping is going to be helpful. I try to approach problems through the eyes of the “ilities” or quality attributes. Approaching the problem through those lenses, for me, serves as a checklist to make sure that I am designing a system that will meet a minimum level of quality.
As an online retailer, we have the normal issues of scalability and uptime. So that is always a good place to start. Casually jotting some of the issues that we have today, I started to see what the goals of the new platform should include. This is the original list that I started with:
What are our goals?
- Low Risk Deployment
- High Frequency Deployment
- Low Response Time at all tiers
- Easily Scalable
- Lower amount of bandwidth usage
- Decrease Mean Time To Repair
- Increase Supportability
- Increase developer efficiency
- Zero Sales Lost to system failure
- Increase security of system so as to comply with PCI and PII best practices
Seems like the normal bucket of wish list of items that you would expect. Some are more easily achievable than others. This list was too long for me to digest, so I narrowed it down to a more reasonable size.
- Improve Deployment Velocity
- Improve Performance
- Reduce Mean Time To Repair (MTTR)
- Reduce Sales Loss Due To Outage To Zero
- Capable of Appropriate Security
This is much easier to reason about what type of system that I need to design. Here are some of the ideas that make sense to me. The items in the original list are not unimportant, but they can be achieved through solving for a larger issue.
Improve Deployment Velocity
There are some obvious aids to support this goal such as automated testing and automated deployment. One of the patterns that I am applying is the isolated based on the single responsibility of a component and its rate of change. Some of the issues that hold us back from releasing more often is the support cases that have to be solved in the current development cycle due to a problem created in the previous one.
If you can isolate a component based on rate of change then you can reduce the amount of manual, or any, testing that is required to get the code out the door. So I think reducing our risk and testing burden helps to achieve this goal. If you add blue-green style deployment, the team does not have a large deployment that is accumulating risk the more time it is one the shelf. We have used this style of deployment for many years now and it has helped us somewhat. We have not matured other areas of our process to maximize the benefits that it can offer.
This is a loaded statement and I know that we have all been there to listen to our stakeholders explain how slow there experience was with something we built. There are so many options available to developers today to start chipping away at the problem. Reduction in hops, smaller transport payload and caching are some of the top ones that comes to mind. Although, none of the options to the performance problem can be achieved without some level compromise. Whether it is in the form of additional complexity, hardware availability or code readability it will always be a give and take tradeoff.
Reduce Mean Time To Repair (MTTR)
Achieving this goal can be a real challenge based on the non-deterministic nature of production failures. Another difficult problem is ensuring the alerting system is correctly configured to give you actionable alerts. We have so many systems that we use to monitor our production environment, each of them creating there own alerts. To complicate things, they are owned by different groups so summarizing an issue can be difficult if you have to try to string together an event.
Reduce Sales Loss Due To Outage To Zero
This seems to be impossible, but I think we can get pretty close from a technology perspective. The use of durable messaging techniques such as queueing will get us much closer to a solution, but it is not the silver bullet that we would like it to be. This is where disaster recovery and server clusters come into play along with patterns such as circuit breaker. This goal will require constant tuning and monitoring to achieve.
Capable of Appropriate Security
Security for an eCommerce site is a very touchy subject lately and the solution pendulum can swing very far in either direction. Jokingly one of my friends in IT Ops said that if he could he would create a VLAN for each server, which is on the “too much” list. On the other side, we have to bend to the needs of the business and at times all we can do is inform them of the risk that they are taking in constraining a solution. To top it off, there isn’t a huge pool of security experts to hire and it would be hard to impress upon management the need to hire one.
Luckily I have two weeks to review my earlier artifacts and make any course correction that may be needed. So I guess “Have Visio, will travel”. I will start to outline in more detail the designs in future posts.
FYI, my writing style is not where I would like it to be so please bear with me it is not for lack of trying.