Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
DataDog asked OpenTelemetry contributor to kill pull request (github.com/open-telemetry)
349 points by raybb on Jan 26, 2023 | hide | past | favorite | 95 comments


From @mx-psi

> Speaking as a Collector contrib maintainer, I just wanted to say that I am not going to continue reviewing this PR or start reviewing other future PRs related to the Datadog APM receiver to avoid any semblance of conflict of interest given my role both as a maintainer on this repository and as a member of Datadog's OpenTelemetry team.

And from their GitHub profile:

> Open Source Software Engineer at Datadog , focusing on OpenTelemetry

What's the conflict of interest exactly? You work at Datadog, supposedly to work on OSS, with a focus on OpenTelemetry and you don't want to review Datadog related code for OpenTelemetry? Sounds weird, that kind of profile is exactly the type of person who should be reviewing the code, they have knowledge of both sides of it.

Rather, it sounds like Datadog is walking back and don't want to support OpenTelemetry if it means it'll support their own tooling, instead of just others.


The potential conflict is that people might think/perceive DD to still be out to harm this project given their previous request for the author to stop development. Anything that might be regular code review or discussion for "why did you choose this approach" could be seen as sabotage on this super public issue.



Not sure I entirely follow what's going on here. Is there some context behind this that's useful to know?


If I understand correctly, this is a receiver — which means you can take your existing DD instrumentation in their format and have it translated to OTel using this code.

It lowers the switching cost to get off of DD.


And Datadog competitor Grafana sponsored it: https://twitter.com/boostchicken/status/1618692475845238784


Protect the moat at all costs!


It's worse than that, they want security through obscurity too. They feel like someone is inappropriately tinkering with the agent they want customers to install. It's open source: https://github.com/DataDog/datadog-agent ...but the downloads are behind a login: https://github.com/DataDog/datadog-agent#datadog-agent


Not really - the script it gives you in-app is

  DD_API_KEY= DD_SITE="datadoghq.com" bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"
They tell you to sign in because installing without a key leads to non-working agents and support tickets.

As for the libraries themselves, they're all on the regular package manager for that language, eg. pypi.


Ah, that's good to know, thanks. I didn't have that link, but I can see it doesn't have an API key, so they can't change the download based on who's downloading it. That's good for transparency. Ditto the package managers. I see some Linux package manager public keys in there too.

Edit: I missed the environment variable before "curl". The .sh is downloaded without the API key but the rest could be done using the API key, since it is passed to the script.


They want to keep their agent secret.

Secret Agent Man...


They've given you a pull number and taken 'way your name


OpenTelemetry basically allows for vendor neutral instrumentation for your metrics, logs, and traces. It makes it easy to switch to a cheaper or better service with minimal lock-in, which is obviously bad for DataDog and other companies business model


> It makes it easy to switch to a cheaper or better service with minimal lock-in

Got any examples?

I tried running my own "stack" for a project I wanted alerting on. I landed on Jaeger all-in-one (wasted time on Zipkin, the UI just was nowhere near as good as it ought to be) Docker container in docker-compose with COLLECTOR_OTLP_ENABLED.


Add another exporter to the `otel-config.yaml` and if they support it properly that should literally be all it takes. That said, "if they support it properly" is exactly what otel is working on. It's super stable for a <v1 release but is still missing some features and polish imho that prevents it from getting the official v1 tag.


If you're looking for something a bit more "few-clicks-and-you-are-up-and-running", check out OpsVerse ObserveNow: https://opsverse.io/observenow-observability/ .. Entirely powered by OSS tools, ingestion-driven pricing, and without the hassle of managing the stack and scaling up.


Check out telemetryhub.com (admittedly I work here).

We offer a free trial and don't charge per a seat.


I understand the need to charge money for things, but the $150 per month pretty much rules out any bootstrapped startups.


I wish I knew the right answer between "spin up a cloud VPS Linux server and run your own with Docker compose" and pay $150/mo for something like DataDog


I used the collector to proxy otel traces to jaeger zipkin for uniform istio tracing.


which is obviously bad for DataDog and other companies business model

Meh. The best way to keep somebody on your product is to make it easy for them to get off your product.


This comment highlights my problem with singular project tracking issues. Once they get so massive like this, people suddenly "can't follow along". Though not specific to GitHub but in this case the ui hampers readability and the comments are very much in-the-moment. This then leaves me to ask the questions, (a) should we stop writing such mammoth comments in issues or (b) should we leverage generative AI to summarize a weekly or monthly timeline of issue interaction. What is the point of having historical issues if it's too complex to follow along? This was a question I regularly asked at GitHub that usually resulted in blank stares.

GitHub's own answer to this is to force engineers to use a /slash command to post a summary of the week's updates. Clunky, but it works.


but if you're actively working on a problem you're in it, and the comments probably make sense. coming in from outside via web search or shared link, yeah, it's probably hard to jump in without full context of the lead up to it. once the problem is fixed, do you really care that much after the fact and have moved on?


How do you manage historical context ?


Wondered the same. Assume this is a threat to their business? An open source competitor can do something cool so they want to kill the functionality?


We use DD, and we have to use Otel for our Elixir apps to ingest spans into DD because there are no DD SDKs for Elixir or Erlang.

From the perspective of a customer, I can tell you that DD already has quite a bit of a moat. Their main competitive advantage, and what got us into using it, is being able to correlate data across APM, custom metrics, and logging through the use of tagging. They then densely link data together across the platform. There is also a built-in Jupyter-style notebooks. By correlating data like that, you get more value out of ingesting as much data into DD as you can. There are some additional services we're not using, such as auto-correlation with ML (and alerting for anomaly detection), and security monitoring that also looks across the entire platform using their ML tech.

Like AWS/GCP/Azure, it can get expensive, quite fast, using on-demand pricing, so there are negotiated annual contracts. Right now, our team is small, and to replicate the functionality we do use, using self-hosted open-source tooling, we might as well hire another engineer for just setting up and maintaining such a platform.

I get it that, you want to defend the moat and that eroding the little things can lead to eroding the big things. As I see it though, if you need those correlations, you'll need a certain scale and team size before it makes sense to build out something like that for yourself.


Their data correlation is awful compared to competitors like Honeycomb, Dynatrace and Instana. What we want to see is something that cuts through all the noisy data and show the users what anomalies are occurring. We shouldn't be sifting through bunch of outdated dashboards and notebooks in this day and age


you may want to check out https://github.com/SigNoz/signoz

I am one of the maintainers. We are building a DataDog alternative with native support for opentelemetry.


I have seen it around, but didn't know it was open source. If you guys are working on correlation, then I will definitely be watching this. Consistent link urls that can be shared is also useful.

Hopefully someone else will contribute the notebooks feature. Those are very useful.

Something that DD is not careful about, is being able to consistently use UTC for all time labels in all graphs (and maybe a quick way to convert to a local time if we need to communicate with stakeholders).

(I don't know why your comment was downvoted).


We use Datadog at the medium sized unicorn I work at. People often don’t understand how important correlating timestamps is. I wish Datadog was just opinionated as hell and said everything you see is UTC by default. Or even less dramatic than that, literally just slap the time zone on whatever it is that you’re viewing. It’s so insanely critical when you’re viewing logs on a prod outage that people immediately see and understand that the logs they are looking at are PST, CST, UTC or whatever. It’s insane that software designed to help people unfuck production systems doesn’t display that by default.

As-is we go through a song and dance whenever we look at logs and metrics “oh, this happened at X time which is Y time for most people.


Our ops team standardized on UTC (though not all of our tools support that).

When we talk to stakeholders and customer-facing folks though, tend to convert it to local time.


Yeah, we are actively working on correlation. We have metrics, traces and logs in a single app - so we should be able to provide a seamless correlation.

Thanks for the point about Notebooks, we have not thought in detail on how people use that. Is it primarily to collaborate between team members when an incident happens or even when there is no incident, and you are analysing stuff


We have used those notebooks for:

- Incidents, collecting different metrics and showing them next to each other, with comments

- Longer-term reliability debugging. They can form a kind of ad-hoc dashboard. These are usually issues that degrade performance, don't have immediate or wide-spread customer impact, and are things we are not immediately able to detect

- Related, performance tuning. Sometimes, the key metric is unknown. We want to explore it, and then make changes to infra, and then see if that moved the needle

- Sometimes, the ad-hoc widgets are useful enough to export to a dashboard

- I can take any widget anywhere else and import it into a notebook, or start a new notebook out of it.

The notebooks are similar to the dashboard, just that, the layout engine only allows a linear notebook layout instead of a grid. There are already text widgets, though the button to add that is easier to access. Other than the comments, it's basically a dashboard with the UI changed so that it feels like a notebook.

Keep in mind too, all dashboard and notebooks modify timestamps and other states in the browser URL, so it is easy for me to copy-paste those into Slack so that other people can see what I am seeing.


> negotiated annual contract

Maybe we are too small but Datadog is one of the few vendors which we haven't been able to negotiate down in years. The price has always been whats on the website. I honestly don't even mind, with some vendors it feels like you are on a basar and they always tell you that their final discountns had to get approval by the CEO.


> Maybe we are too small but Datadog is one of the few vendors which we haven't been able to negotiate down in years.

We spend a few thousand a month with Datadog and our account manager reaches out every quarter to adjust our monthly commit up/down which provides a 20% discount (I think) or so off from the website prices.


I always tell people looking at Datadog to remember "DDDD" or "DataDog Don't Discount".

Compared to most Enterprise vendors it is a lot harder to get a discount from Datadog. Most vendors will give you 1/3 off just for signing a contract and committing to a spend, Datadog is not like that.


Just show them New Relics new pricing


the New Relic who either had a breach they never notified me or sold my email address? I've getting random spam to newrelic@<domain> for years now. Nope, they'll never see any money if I can avoid it.


I don't know if we got a discount for an annual contract, or if the annual contract pricing was published on the website. But we were able to work out things like, using seasonal pricing because our traffic is seasonal.


Is Mr boostchicken a DD employee? If so then this makes sense. Otherwise, it doesn't.


Seems they are not (Meta is mentioned on their profile), but it doesn't make sense either way. Datadog should not be pressuring either their own employees or others to add Datadog relevant collectors to OpenTelemetry, especially when their public position they put out is that they "love" and support efforts like OpenTelemtry.


Oh wow... I worked with Boostchicken at Sony. He's not at DataDog afaik.


I think if Borat was working at my former office, I would know about it.


Under Blake, so that side of the floor!


Thanks for the link. I didn't see it in first skim through. It was in hidden comments, and I had to go back and click a few before i found it.


Related, from Feb 2022, on what this is and why it's important: https://twitter.com/mipsytipsy/status/1494861690059759616


I'm always willing to dogpile on DataDog (no pun intended), given bad experiences with their sales, but all I'm reading out of this is that the DD person didn't want to review it out of a potential conflict of interest. It was reviewed however by someone else. Am I missing something here?


You're missing that this was a year-old PR ready to merge but it was stopped from merging by DataDog because it makes it easier for users to migrate from DataDog to another logging service


Where did DD stop the merge. I just saw them simply state that the reviewer didn't want to review it as they work at Datadog. It could simply be that they are biased TO merge it even if it contains flaws. Having an independent reviewer ensures the code quality meets the bar.


The OP didn't link directly to the comment and it's buried in the thread: https://github.com/open-telemetry/opentelemetry-collector-co...


No, I am unsurprised by the DataDog taking that position; it fits with what I know of the company. However as someone pointed out, the comment in question wasn't originally linked, but I see it now.



Tentative TLDR: It seems like DD pressured a contributor into not shipping a feature that would have made DD just another telemetry vendor.

It seems like right now, data can flow IN datadog libraries/agents but not out. This PR would sort of allow data to flow OUT of datadog's libs/agents?

And DD doesn't want that because it removes their lock-in power?

Is this correct? This would be extremely crappy of Datadog.


None of the APM providers like New Relic, Data Dog, or Azure App Insights want a truly open ecosystem.

The only reason they support "Open" Telemetry is because they're worried about lock-in at the data sources.

For example, App Insights supports rich/structured telemetry via its proprietary SDK and various APIs. No open-source developer in their right mind would ever hard code such a proprietary dependency into something published under a truly open license.

Now that rich telemetry instead of simple text logging is starting to become an increasingly popular approach, the proprietary APM vendors got nervous that they would get "locked out" of the entire open source ecosystem, to be replaced by a data source that is open and not compatible with their proprietary sinks.

Hence Open Telemetry.

It was always about making the source open, not the sink.


I won't speculate on the top-level strategies of companies competing with another, so take this as a grain of salt.

From my perspective (maintainer, employed by a vendor), all of us who work for these different companies collaborate very well together. We all recognize that it's both technically tractable and fundamentally user-friendly to make instrumentation be a common standard that anyone can use to point at any of the OSS and commercial tools in this space. There's plenty to differentiate on with telemetry backends, querying experiences, API capabilities, data analysis tools and UX, etc. We have a long way to go to see this vision fully realized, but it's quite far along and I have no doubt we'll arrive at the right outcome here.


Everything you've just said is compatible with what I've said.

The vendors are concerned about being left out at the "source", and are seeking to differentiate with their proprietary "sink", usually closed-source SaaS solutions.

I'm not even arguing that this is bad, it's just how markets work, and it's currently beneficial to developers in general, including both open-source developers and the type working in a cubicle farm somewhere.


...sure? I guess I'm saying that I don't really care about this speculation, since the reality on the ground is that we all collaborate.


100% disagree. I've talked with many of those providers and their CSuites are actually behind the open ecosystem. You forget many of them have huge marketshare to gain by having an open ecosystem


Open source these days is very often about making the thing you use to consume SaaS open so it can spread as widely as possible, but it’s never made easy to commoditize the SaaS part.


> This PR would sort of allow data to flow OUT of datadog's libs/agents?

Yes.

This allows you to expose a Telemetry collector on the datadog agent port 8126[0], allowing you to collect Application Performance Monitoring (APM) traces from any APM-enabled datadog library[1].

If I had to guess, DataDog's argument is that they don't want you using the engineering hours they invest into their libraries to have DD do the heavy lifting of collecting APM traces and send the messages off to another service.

<removed OSS comment>

0: https://github.com/boostchicken/opentelemetry-collector-cont...

1: https://docs.datadoghq.com/tracing/#send-traces-to-datadog


When we switched off of AppSignal, since our instrumentation as deeply intertwined with our code, it took a while to change that over to Otel. But at least now, being Otel, theoretically if we want to change it to a different vendor, we can. If we were to change off of DD, and there is an otel collector that can accept DD APM spans, then we can switch now, refactor later.

In practice, DD has a lot going for it that I don't see in New Relic. There are also some key features in DD that is not in OTEl -- for example, we can't use DD's APM ingestion controls for controlling sampling rates for OTEL spans, and DD has no incentive to add such a feature. I'm actually working on adding in Otel sampling into our project right now. (In our case, we have to use Otel because DD does not have SDKs for Elixir)


Why sample otel spans and miss out on the important ones?


You can sorta have your cake and eat to too.

Firstly, not all spans are interesting. When 99.99% of your traffic is just going to serve up an HTTP 200 within your acceptable latency threshold, you don't need every one of those. You probably do want to keep 100% of error spans, or those where the root has a duration beyond a configured threshold. There's tools to be able to sample that way.

Secondly, there's ways to also attach your effective sample rate as metadata to spans, and if there's a backend that supports re-weighting counts based on that, you can still get accurate all-up counts of overall traffic.

Admittedly, OTel and many other backends don't have the best story for this yet. But it's getting better.


While I would like to ingest every one, cost is a factor.

Even if we were self-hosting, there's a cost to ingesting and storing every single span.

And even if we are able to pay for ingesting 100%, not everything is practical to be ingested 100%. Our most common request type (heartbeat) generate a span payload size that is a multiple of the original request. We're using Elixir in production, and those can absorb a tremendous amount of traffic, saturating the entire CPU capacity of the hardware if we let it. The agents are not capable of keeping up.


Because of cost?


The library you linked is just for sending metric data (gauges, counters etc), the relevant library for APM is called ddtrace and its dual licensed with Apache2 and BSD3.

https://github.com/DataDog/dd-trace-py/blob/1.x/LICENSEe


> DataDog's libraries don't seem to be OSS: https://github.com/DataDog/datadogpy/blob/master/LICENSE

Is that not a verbatim the 3-clause BSD license (which is an FSF/OSI approved OSS license)?


that LICENSE of theirs is literally 3 clause BSD?


As far as I can tell literally no vendor allows data to flow out? What am i missing?


More vendor lock-in shenanigans.

Data dog has always been a proprietary POS. I don’t know why people use it, APM traces? How long before Grafana has these capabilities in OSS?

So annoying seeing a company like DD who cannot innovate at all, trying to lock in the average company.


Do you actually use it at scale in a large complicated system? IMHO it’s the opposite - datadog is consistently the most innovative and fastest iterating observability platform out there. Nothing comes close. It’s expensive yes, and there’s some lock-in yes, but they are GOOD.


As someone who has been in this space a long time I think you are pretty off base here. The commercial vendors have absolutely been the ones who did all the innovations in tracing. The OSS equivalents were really poor imitations for the longest time. Zipkin/Jaeger/Opentrace have been coming along slowly for many years, but it's really been maybe 2 years since that work has really become competitive with what was available in the commercial APM space. For the most part it's felt like that only happened because the commercial tools have all halted new work on their proprietary tools and told their staff to push Otel over the finish line. None of them actually want to pay teams of engineers to maintain proprietary libraries for every language, the execs are absolutely drooling to find a way to pawn as much of that cost off onto others via Otel. Locking people in on agents is naive when the truly heavy lift in migrating platforms is the swapping dashboards and alerts and retraining all your staff.

You see the same thing around integrations, everyone used to have to roll their own proprietary chunks of code that in the end were all querying mostly the same data points back from servers and API's. Now everyone just prefers to wait for the Prometheus exporter and they adopt that instead.


> How long before Grafana has these capabilities in OSS?

They've already started down that path with Grafana Tempo[0]. It's functional, but their UX and discoverability need a lot of work.

[0]: https://grafana.com/oss/tempo/


Datadog is stupidly expensive for what it provides. Don't recommend and I hope something open kills it soon.


I'm not so sure to be honest. It is very expensive. But being able to quickly get all of the features is fantastic for a small company. As you start to use more volume (and your bill starts growing) then it makes sense to start considering self-hosting. But for many teams with relatively small data volumes self-hosting is likely not worth the engineering time.


Out of curiosity, is there anything self-hosted that compares to data-dog RUM?


Likewise for New Relic. Unfortunately OTel and everything similar doesn't remotely come close to the usability and ease of configuration that New Relic does. Not yet, anyway...


Ideally you just implement OTel into your apps and later plug that into any metric/log/trace collector you want, so you have DataDog and Newrelic and other frontends competing based on their feature suite and price.

As it stands, this was/is trying to use DataDog's existing APM libraries to turn them into OTel for ingest into other providers.


Plus they have (or had, just going off of my personal experience) very unethical sales practices.


you may want to have a look at SigNoz - https://github.com/SigNoz/signoz

PS: I am one of the maintainers


stabs repeatedly




I love this bit:

> Hello, I'm using this receiver in production for about one year, and left some comments that may be helpful.

>> Are you serious? That is intense. Did it scale? Any memory issues? If I remember correctly I tuned all that away.

Always astonishing to see broken stuff doing well in prod lol.


“Not production ready” stopped noone


I want to go for a drink with Moonheart (who used this unmerged, untested, year old PR in Production for a year) and find out how I, too, can become an absolute maverick.


datadogs APM traces are the one thing that locks companies into Datadog. If this is merged it makes it a lot easier to move out of datadog into something else.


The only possible lock-in I see is if you have annotated code with Datadog specific tracing instructions and I'm pretty sure that most of these features aren't relevant for Open Telemetry.

Switching APM providers isn't hard at all. Maybe the Open Telemetry ecosystem lacks a good agent but my understanding is that their APM agent is pretty much already a copy of the Datadog agent.

I think New Relic would be a more interesting agent to use because their approach to APM isn't primarily to do sampled traces.



Datadog support is hot garbage, and for the amount of money I've sent them over the years they ignore feature requests and just DGAF about customers.

They're very hot on licensing though.


Nice tool. A couple years ago I wrote a similar tool that translates Datadog's agent's metrics/checks into AWS Cloudwatch equivalents.


Datadog is working on an OAuth API which may eventually enable some of this data to flow out of it https://www.datadoghq.com/blog/oauth/


How can DataDog threaten open source developers? Can they sue them? what is it?


A little lost here as well but my read was that the PR author was working with outside parties (their company??) to get approval (sign a CLA?) and some third party asked for Datadog's opinion and Datadog advised that they wait for a Datadog feature. Due to this discussion, the third party was in "default no" and the PR author did not push the matter, expecting Datadog to release something, when they did not after several months, the author reopened and pushed to get approval.

I could be totally wrong, but maybe combined we can make sense of it. From my read it feels like DD misguided this third party which then advised the author to spike.


We switched to OpenTelemetry agent from DataDog agent. Out of all commercial providers, Datadog is the worst one to show traces properly. Although it captures mostly the same traces, all of them have io.opentelemetry.something on UI making it very hard to read. I’ve tried Honeycomb, NewRelic, Lightstep.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: