Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

On the other hand, they've had plenty of time and resources to do just that in a reliable fashion, it's not like it's one guy in his bedroom (I hope!). It's not like they are volunteers doing this open source for the community, they are getting paid (very well, I assume) to run the system. And Management is getting paid (even better, I assume) to make sure the priorities are right and correct decisions are taken. "Who could've known there might be a lot more traffic" sounds like somebody failed in Management, and engineering might have failed by not foreseeing the issue and/or informing Management.

Sure, don't burn people at the stake, but "hey, it's hard, don't blame them, they are doing their best" doesn't cut it for me. I'm sure they're expecting to be paid and not for someone to "do their best" to pay them.



Can you give me a concrete example of a massive distributed system that has zero downtime?

Because the largest distributed system I have seen and worked on was at Apple (or maybe DFP at Google) - and even though they had some of the smartest people in the world and literally billions of dollars behind them, there were still an endless list of problems and downtime events.

Spoiler alert: It doesn't exist.


The point isn't that "a system cannot fail", the point is "if the system fails, it's no big deal, shit happens, cut them some slack" is a weird way to look at it for corporate systems, especially in sensitive areas.

If you're running a HA system and you only need one nine to express your availability percentage, sure, sure, you have the smartest people etc and you're doing such a great job, and yeah, yeah, show me one system that has 100% uptime etc.


It didn't say it's no big deal, you're extrapolating and exaggerating my words because your argument is weak.

My point was that failure is inevitable in any complex system, and I was responding to the parents point that he immediately pointed the finger at management in an accusatory way, and I was saying that's not constructive.

Also your point "They expect to be paid" is actually implicitly "I expect management do do their best to pay me" - there could be a failure in the payroll system, there could be a failure in the banks, there could be many reasons outside managements control that means I'm not getting paid. I can say "why don't you have redundant payroll systems" (which is a stupid waste of resources given the cost/benefit/low failure rate) But my point is again - complex systems have failures - and SOMETIMES, JUST SOMETIMES, YOU CAN CUT THEM SOME SLACK.


When a fiduciary breaks their duty to their clients, you don’t cut them slack. You sue them. This isn’t like Silicon Valley where you can get away with antics like this.


You must be new here, welcome to late stage capitalism. Nobody rich goes to jail, and lawsuits are cost of business. You just factor them into the 5 billion dollar company, pay your 300M dollar fine and walk away a billionaire.

It sucks, I'm not defending it, but it's fact


Google doesn’t target zero downtime. The marginal cost is too high. For important services (like Search page and ads) they aim for 5 nines uptime (99.999%), which translates to 5 minutes of downtime per year.

https://en.m.wikipedia.org/wiki/High_availability


As an ex telco guy all I can say is "amateurs" :-)


And not a mention of Erlang at all? ;)


I wasn't in Traffic


I know, I worked there for 3 years on million node clusters


Then certainly you understand the importance of SLOs, how SLAs regulate reliability and feature velocity.

Let’s say I’m RobinHood. Let’s pick an SLO. I think three nines monthly SLO is a good start, that budgets ~45 minutes of down time per month. Maybe I can argue for a more aggressive SLO, but let’s pick this one - because I think it will keep users relatively happy as trades aren’t blocked for more than an hour at worst. I drive an agreement with stakeholders that if we needle out of this SLO, we drop all feature work and focus on hardening reliability.

RobinHood was out for a whole day. This is unacceptable. It points to a complete organizational fuck up - product and feature development have too much power and priority at the expense of reliability.

I’m not sure that RobinHood has ever heard of SLOs or reliability engineering. I really hope their leadership is smart enough to hire and empower the right people that will drive organizational change.


Why would they burden themselves and their feature velocity with SLOs/SLAs when they can build a 5 billion dollar company insanely quickly even though they have downtime?

The users are not saying "We measured your 5 9's and I'm going to quit if you have 6 minutes more downtime"

Sure they lose some users who get annoyed, but they have a 5.6 billion dollar company, some users will go, a lot more are coming


Users are saying “you were down for an entire day and I lost money - I’m out”.

Your reliability target is a product decision. Maybe with the right features the market will tolerate shitty unreliable financial services that falls over for an entire day. Or maybe RobinHood will go from a 5.6 billion dollar company to a zero dollar company because users hate them.

Point is high reliability is choice based on priorities - which seems like RobinHood does not care about. And I will certainly stay the fuck away from their platform.


> some users will go, a lot more are coming

This works in the acquisition phase, which I suspect Robinhood is nearing the end of.

Once their userbase turns into the retention or conversion (competitors have $0 trades now, too) phases, mistakes like this are much more costly in the long term.


You're missing the point. Reliability and Performance are features in Financial markets. It is a key feature for brokerages which they constantly advertise to differentiate themselves. These companies lay undersea cables to shave off few milli-seconds latency and pay a very hefty premium to be colocated in the same DC/rack as the stock exchange. Therefore Performance and Reliability are inseparable.

Nobody is debating whether people will continue using RH and that was never the issue. RH has massively damaged its reputation and reputation _is_ everything.


Dialcom (Telecom Gold) in the UK was pretty close to 100% Almost survived the big storm of 87 - unfortunately the modems where on the UPS.

We built an entire new DC and had Tottenham Court Road dug up in case the Thames flooded.

In fact any big telecom will have down times for a switch (central office) measured in generations


Those Silicon Valley kids can't understand reliability... Good thing my bank and my broker are not run like this... what a joke...


You sir have a very warped view of SV. Do not stereotype us.


First hand experience... move fast and break thing, forever in beta, etc. does not always work, especially when you have proper SLAs


Can you name a reputable brokerage that was down all of Monday and Tuesday this week?

Spoiler alert: it doesn't exist


Interactive Brokers.

Well known in the financial community, but nearly unknown outside of it.


Not true: Interactive Brokers were up as always and the API was working without a flaw...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: