On the other hand, they've had plenty of time and resources to do just that in a...

malux85 · on March 4, 2020

Can you give me a concrete example of a massive distributed system that has zero downtime?

Because the largest distributed system I have seen and worked on was at Apple (or maybe DFP at Google) - and even though they had some of the smartest people in the world and literally billions of dollars behind them, there were still an endless list of problems and downtime events.

Spoiler alert: It doesn't exist.

luckylion · on March 4, 2020

The point isn't that "a system cannot fail", the point is "if the system fails, it's no big deal, shit happens, cut them some slack" is a weird way to look at it for corporate systems, especially in sensitive areas.

If you're running a HA system and you only need one nine to express your availability percentage, sure, sure, you have the smartest people etc and you're doing such a great job, and yeah, yeah, show me one system that has 100% uptime etc.

malux85 · on March 4, 2020

It didn't say it's no big deal, you're extrapolating and exaggerating my words because your argument is weak.

My point was that failure is inevitable in any complex system, and I was responding to the parents point that he immediately pointed the finger at management in an accusatory way, and I was saying that's not constructive.

Also your point "They expect to be paid" is actually implicitly "I expect management do do their best to pay me" - there could be a failure in the payroll system, there could be a failure in the banks, there could be many reasons outside managements control that means I'm not getting paid. I can say "why don't you have redundant payroll systems" (which is a stupid waste of resources given the cost/benefit/low failure rate) But my point is again - complex systems have failures - and SOMETIMES, JUST SOMETIMES, YOU CAN CUT THEM SOME SLACK.

0x8BADF00D · on March 4, 2020

When a fiduciary breaks their duty to their clients, you don’t cut them slack. You sue them. This isn’t like Silicon Valley where you can get away with antics like this.

malux85 · on March 4, 2020

You must be new here, welcome to late stage capitalism. Nobody rich goes to jail, and lawsuits are cost of business. You just factor them into the 5 billion dollar company, pay your 300M dollar fine and walk away a billionaire.

It sucks, I'm not defending it, but it's fact

unicornmama · on March 4, 2020

Google doesn’t target zero downtime. The marginal cost is too high. For important services (like Search page and ads) they aim for 5 nines uptime (99.999%), which translates to 5 minutes of downtime per year.

https://en.m.wikipedia.org/wiki/High_availability

C1sc0cat · on March 4, 2020

As an ex telco guy all I can say is "amateurs" :-)

czbond · on March 4, 2020

And not a mention of Erlang at all? ;)

C1sc0cat · on March 4, 2020

I wasn't in Traffic

malux85 · on March 4, 2020

I know, I worked there for 3 years on million node clusters

unicornmama · on March 4, 2020

Then certainly you understand the importance of SLOs, how SLAs regulate reliability and feature velocity.

Let’s say I’m RobinHood. Let’s pick an SLO. I think three nines monthly SLO is a good start, that budgets ~45 minutes of down time per month. Maybe I can argue for a more aggressive SLO, but let’s pick this one - because I think it will keep users relatively happy as trades aren’t blocked for more than an hour at worst. I drive an agreement with stakeholders that if we needle out of this SLO, we drop all feature work and focus on hardening reliability.

RobinHood was out for a whole day. This is unacceptable. It points to a complete organizational fuck up - product and feature development have too much power and priority at the expense of reliability.

I’m not sure that RobinHood has ever heard of SLOs or reliability engineering. I really hope their leadership is smart enough to hire and empower the right people that will drive organizational change.

malux85 · on March 4, 2020

Why would they burden themselves and their feature velocity with SLOs/SLAs when they can build a 5 billion dollar company insanely quickly even though they have downtime?

The users are not saying "We measured your 5 9's and I'm going to quit if you have 6 minutes more downtime"

Sure they lose some users who get annoyed, but they have a 5.6 billion dollar company, some users will go, a lot more are coming

unicornmama · on March 4, 2020

Users are saying “you were down for an entire day and I lost money - I’m out”.

Your reliability target is a product decision. Maybe with the right features the market will tolerate shitty unreliable financial services that falls over for an entire day. Or maybe RobinHood will go from a 5.6 billion dollar company to a zero dollar company because users hate them.

Point is high reliability is choice based on priorities - which seems like RobinHood does not care about. And I will certainly stay the fuck away from their platform.

ethbro · on March 4, 2020

> some users will go, a lot more are coming

This works in the acquisition phase, which I suspect Robinhood is nearing the end of.

Once their userbase turns into the retention or conversion (competitors have $0 trades now, too) phases, mistakes like this are much more costly in the long term.

techie128 · on March 5, 2020

You're missing the point. Reliability and Performance are features in Financial markets. It is a key feature for brokerages which they constantly advertise to differentiate themselves. These companies lay undersea cables to shave off few milli-seconds latency and pay a very hefty premium to be colocated in the same DC/rack as the stock exchange. Therefore Performance and Reliability are inseparable.

Nobody is debating whether people will continue using RH and that was never the issue. RH has massively damaged its reputation and reputation _is_ everything.

C1sc0cat · on March 4, 2020

Dialcom (Telecom Gold) in the UK was pretty close to 100% Almost survived the big storm of 87 - unfortunately the modems where on the UPS.

We built an entire new DC and had Tottenham Court Road dug up in case the Thames flooded.

In fact any big telecom will have down times for a switch (central office) measured in generations

SirLJ · on March 4, 2020

Those Silicon Valley kids can't understand reliability... Good thing my bank and my broker are not run like this... what a joke...

techie128 · on March 5, 2020

You sir have a very warped view of SV. Do not stereotype us.

SirLJ · on March 9, 2020

First hand experience... move fast and break thing, forever in beta, etc. does not always work, especially when you have proper SLAs

frockington1 · on March 4, 2020

Can you name a reputable brokerage that was down all of Monday and Tuesday this week?

Spoiler alert: it doesn't exist

plusplusc · on March 4, 2020

Interactive Brokers.

Well known in the financial community, but nearly unknown outside of it.

SirLJ · on March 4, 2020

Not true: Interactive Brokers were up as always and the API was working without a flaw...