> Speaking as a Collector contrib maintainer, I just wanted to say that I am not going to continue reviewing this PR or start reviewing other future PRs related to the Datadog APM receiver to avoid any semblance of conflict of interest given my role both as a maintainer on this repository and as a member of Datadog's OpenTelemetry team.
And from their GitHub profile:
> Open Source Software Engineer at Datadog , focusing on OpenTelemetry
What's the conflict of interest exactly? You work at Datadog, supposedly to work on OSS, with a focus on OpenTelemetry and you don't want to review Datadog related code for OpenTelemetry? Sounds weird, that kind of profile is exactly the type of person who should be reviewing the code, they have knowledge of both sides of it.
Rather, it sounds like Datadog is walking back and don't want to support OpenTelemetry if it means it'll support their own tooling, instead of just others.
The potential conflict is that people might think/perceive DD to still be out to harm this project given their previous request for the author to stop development. Anything that might be regular code review or discussion for "why did you choose this approach" could be seen as sabotage on this super public issue.
If I understand correctly, this is a receiver — which means you can take your existing DD instrumentation in their format and have it translated to OTel using this code.
Ah, that's good to know, thanks. I didn't have that link, but I can see it doesn't have an API key, so they can't change the download based on who's downloading it. That's good for transparency. Ditto the package managers. I see some Linux package manager public keys in there too.
Edit: I missed the environment variable before "curl". The .sh is downloaded without the API key but the rest could be done using the API key, since it is passed to the script.
OpenTelemetry basically allows for vendor neutral instrumentation for your metrics, logs, and traces. It makes it easy to switch to a cheaper or better service with minimal lock-in, which is obviously bad for DataDog and other companies business model
> It makes it easy to switch to a cheaper or better service with minimal lock-in
Got any examples?
I tried running my own "stack" for a project I wanted alerting on. I landed on Jaeger all-in-one (wasted time on Zipkin, the UI just was nowhere near as good as it ought to be) Docker container in docker-compose with COLLECTOR_OTLP_ENABLED.
Add another exporter to the `otel-config.yaml` and if they support it properly that should literally be all it takes. That said, "if they support it properly" is exactly what otel is working on. It's super stable for a <v1 release but is still missing some features and polish imho that prevents it from getting the official v1 tag.
If you're looking for something a bit more "few-clicks-and-you-are-up-and-running", check out OpsVerse ObserveNow: https://opsverse.io/observenow-observability/ .. Entirely powered by OSS tools, ingestion-driven pricing, and without the hassle of managing the stack and scaling up.
I wish I knew the right answer between "spin up a cloud VPS Linux server and run your own with Docker compose" and pay $150/mo for something like DataDog
This comment highlights my problem with singular project tracking issues. Once they get so massive like this, people suddenly "can't follow along". Though not specific to GitHub but in this case the ui hampers readability and the comments are very much in-the-moment. This then leaves me to ask the questions, (a) should we stop writing such mammoth comments in issues or (b) should we leverage generative AI to summarize a weekly or monthly timeline of issue interaction. What is the point of having historical issues if it's too complex to follow along? This was a question I regularly asked at GitHub that usually resulted in blank stares.
GitHub's own answer to this is to force engineers to use a /slash command to post a summary of the week's updates. Clunky, but it works.
but if you're actively working on a problem you're in it, and the comments probably make sense. coming in from outside via web search or shared link, yeah, it's probably hard to jump in without full context of the lead up to it. once the problem is fixed, do you really care that much after the fact and have moved on?
We use DD, and we have to use Otel for our Elixir apps to ingest spans into DD because there are no DD SDKs for Elixir or Erlang.
From the perspective of a customer, I can tell you that DD already has quite a bit of a moat. Their main competitive advantage, and what got us into using it, is being able to correlate data across APM, custom metrics, and logging through the use of tagging. They then densely link data together across the platform. There is also a built-in Jupyter-style notebooks. By correlating data like that, you get more value out of ingesting as much data into DD as you can. There are some additional services we're not using, such as auto-correlation with ML (and alerting for anomaly detection), and security monitoring that also looks across the entire platform using their ML tech.
Like AWS/GCP/Azure, it can get expensive, quite fast, using on-demand pricing, so there are negotiated annual contracts. Right now, our team is small, and to replicate the functionality we do use, using self-hosted open-source tooling, we might as well hire another engineer for just setting up and maintaining such a platform.
I get it that, you want to defend the moat and that eroding the little things can lead to eroding the big things. As I see it though, if you need those correlations, you'll need a certain scale and team size before it makes sense to build out something like that for yourself.
Their data correlation is awful compared to competitors like Honeycomb, Dynatrace and Instana. What we want to see is something that cuts through all the noisy data and show the users what anomalies are occurring. We shouldn't be sifting through bunch of outdated dashboards and notebooks in this day and age
I have seen it around, but didn't know it was open source. If you guys are working on correlation, then I will definitely be watching this. Consistent link urls that can be shared is also useful.
Hopefully someone else will contribute the notebooks feature. Those are very useful.
Something that DD is not careful about, is being able to consistently use UTC for all time labels in all graphs (and maybe a quick way to convert to a local time if we need to communicate with stakeholders).
We use Datadog at the medium sized unicorn I work at. People often don’t understand how important correlating timestamps is. I wish Datadog was just opinionated as hell and said everything you see is UTC by default. Or even less dramatic than that, literally just slap the time zone on whatever it is that you’re viewing. It’s so insanely critical when you’re viewing logs on a prod outage that people immediately see and understand that the logs they are looking at are PST, CST, UTC or whatever. It’s insane that software designed to help people unfuck production systems doesn’t display that by default.
As-is we go through a song and dance whenever we look at logs and metrics “oh, this happened at X time which is Y time for most people.
Yeah, we are actively working on correlation. We have metrics, traces and logs in a single app - so we should be able to provide a seamless correlation.
Thanks for the point about Notebooks, we have not thought in detail on how people use that. Is it primarily to collaborate between team members when an incident happens or even when there is no incident, and you are analysing stuff
- Incidents, collecting different metrics and showing them next to each other, with comments
- Longer-term reliability debugging. They can form a kind of ad-hoc dashboard. These are usually issues that degrade performance, don't have immediate or wide-spread customer impact, and are things we are not immediately able to detect
- Related, performance tuning. Sometimes, the key metric is unknown. We want to explore it, and then make changes to infra, and then see if that moved the needle
- Sometimes, the ad-hoc widgets are useful enough to export to a dashboard
- I can take any widget anywhere else and import it into a notebook, or start a new notebook out of it.
The notebooks are similar to the dashboard, just that, the layout engine only allows a linear notebook layout instead of a grid. There are already text widgets, though the button to add that is easier to access. Other than the comments, it's basically a dashboard with the UI changed so that it feels like a notebook.
Keep in mind too, all dashboard and notebooks modify timestamps and other states in the browser URL, so it is easy for me to copy-paste those into Slack so that other people can see what I am seeing.
Maybe we are too small but Datadog is one of the few vendors which we haven't been able to negotiate down in years. The price has always been whats on the website. I honestly don't even mind, with some vendors it feels like you are on a basar and they always tell you that their final discountns had to get approval by the CEO.
> Maybe we are too small but Datadog is one of the few vendors which we haven't been able to negotiate down in years.
We spend a few thousand a month with Datadog and our account manager reaches out every quarter to adjust our monthly commit up/down which provides a 20% discount (I think) or so off from the website prices.
I always tell people looking at Datadog to remember "DDDD" or "DataDog Don't Discount".
Compared to most Enterprise vendors it is a lot harder to get a discount from Datadog. Most vendors will give you 1/3 off just for signing a contract and committing to a spend, Datadog is not like that.
the New Relic who either had a breach they never notified me or sold my email address? I've getting random spam to newrelic@<domain> for years now. Nope, they'll never see any money if I can avoid it.
I don't know if we got a discount for an annual contract, or if the annual contract pricing was published on the website. But we were able to work out things like, using seasonal pricing because our traffic is seasonal.
Seems they are not (Meta is mentioned on their profile), but it doesn't make sense either way. Datadog should not be pressuring either their own employees or others to add Datadog relevant collectors to OpenTelemetry, especially when their public position they put out is that they "love" and support efforts like OpenTelemtry.
I'm always willing to dogpile on DataDog (no pun intended), given bad experiences with their sales, but all I'm reading out of this is that the DD person didn't want to review it out of a potential conflict of interest. It was reviewed however by someone else. Am I missing something here?
You're missing that this was a year-old PR ready to merge but it was stopped from merging by DataDog because it makes it easier for users to migrate from DataDog to another logging service
Where did DD stop the merge. I just saw them simply state that the reviewer didn't want to review it as they work at Datadog. It could simply be that they are biased TO merge it even if it contains flaws. Having an independent reviewer ensures the code quality meets the bar.
No, I am unsurprised by the DataDog taking that position; it fits with what I know of the company. However as someone pointed out, the comment in question wasn't originally linked, but I see it now.
None of the APM providers like New Relic, Data Dog, or Azure App Insights want a truly open ecosystem.
The only reason they support "Open" Telemetry is because they're worried about lock-in at the data sources.
For example, App Insights supports rich/structured telemetry via its proprietary SDK and various APIs. No open-source developer in their right mind would ever hard code such a proprietary dependency into something published under a truly open license.
Now that rich telemetry instead of simple text logging is starting to become an increasingly popular approach, the proprietary APM vendors got nervous that they would get "locked out" of the entire open source ecosystem, to be replaced by a data source that is open and not compatible with their proprietary sinks.
Hence Open Telemetry.
It was always about making the source open, not the sink.
I won't speculate on the top-level strategies of companies competing with another, so take this as a grain of salt.
From my perspective (maintainer, employed by a vendor), all of us who work for these different companies collaborate very well together. We all recognize that it's both technically tractable and fundamentally user-friendly to make instrumentation be a common standard that anyone can use to point at any of the OSS and commercial tools in this space. There's plenty to differentiate on with telemetry backends, querying experiences, API capabilities, data analysis tools and UX, etc. We have a long way to go to see this vision fully realized, but it's quite far along and I have no doubt we'll arrive at the right outcome here.
Everything you've just said is compatible with what I've said.
The vendors are concerned about being left out at the "source", and are seeking to differentiate with their proprietary "sink", usually closed-source SaaS solutions.
I'm not even arguing that this is bad, it's just how markets work, and it's currently beneficial to developers in general, including both open-source developers and the type working in a cubicle farm somewhere.
100% disagree. I've talked with many of those providers and their CSuites are actually behind the open ecosystem. You forget many of them have huge marketshare to gain by having an open ecosystem
Open source these days is very often about making the thing you use to consume SaaS open so it can spread as widely as possible, but it’s never made easy to commoditize the SaaS part.
> This PR would sort of allow data to flow OUT of datadog's libs/agents?
Yes.
This allows you to expose a Telemetry collector on the datadog agent port 8126[0], allowing you to collect Application Performance Monitoring (APM) traces from any APM-enabled datadog library[1].
If I had to guess, DataDog's argument is that they don't want you using the engineering hours they invest into their libraries to have DD do the heavy lifting of collecting APM traces and send the messages off to another service.
When we switched off of AppSignal, since our instrumentation as deeply intertwined with our code, it took a while to change that over to Otel. But at least now, being Otel, theoretically if we want to change it to a different vendor, we can. If we were to change off of DD, and there is an otel collector that can accept DD APM spans, then we can switch now, refactor later.
In practice, DD has a lot going for it that I don't see in New Relic. There are also some key features in DD that is not in OTEl -- for example, we can't use DD's APM ingestion controls for controlling sampling rates for OTEL spans, and DD has no incentive to add such a feature. I'm actually working on adding in Otel sampling into our project right now. (In our case, we have to use Otel because DD does not have SDKs for Elixir)
Firstly, not all spans are interesting. When 99.99% of your traffic is just going to serve up an HTTP 200 within your acceptable latency threshold, you don't need every one of those. You probably do want to keep 100% of error spans, or those where the root has a duration beyond a configured threshold. There's tools to be able to sample that way.
Secondly, there's ways to also attach your effective sample rate as metadata to spans, and if there's a backend that supports re-weighting counts based on that, you can still get accurate all-up counts of overall traffic.
Admittedly, OTel and many other backends don't have the best story for this yet. But it's getting better.
While I would like to ingest every one, cost is a factor.
Even if we were self-hosting, there's a cost to ingesting and storing every single span.
And even if we are able to pay for ingesting 100%, not everything is practical to be ingested 100%. Our most common request type (heartbeat) generate a span payload size that is a multiple of the original request. We're using Elixir in production, and those can absorb a tremendous amount of traffic, saturating the entire CPU capacity of the hardware if we let it. The agents are not capable of keeping up.
The library you linked is just for sending metric data (gauges, counters etc), the relevant library for APM is called ddtrace and its dual licensed with Apache2 and BSD3.
Do you actually use it at scale in a large complicated system? IMHO it’s the opposite - datadog is consistently the most innovative and fastest iterating observability platform out there. Nothing comes close.
It’s expensive yes, and there’s some lock-in yes, but they are GOOD.
As someone who has been in this space a long time I think you are pretty off base here. The commercial vendors have absolutely been the ones who did all the innovations in tracing. The OSS equivalents were really poor imitations for the longest time. Zipkin/Jaeger/Opentrace have been coming along slowly for many years, but it's really been maybe 2 years since that work has really become competitive with what was available in the commercial APM space. For the most part it's felt like that only happened because the commercial tools have all halted new work on their proprietary tools and told their staff to push Otel over the finish line. None of them actually want to pay teams of engineers to maintain proprietary libraries for every language, the execs are absolutely drooling to find a way to pawn as much of that cost off onto others via Otel. Locking people in on agents is naive when the truly heavy lift in migrating platforms is the swapping dashboards and alerts and retraining all your staff.
You see the same thing around integrations, everyone used to have to roll their own proprietary chunks of code that in the end were all querying mostly the same data points back from servers and API's. Now everyone just prefers to wait for the Prometheus exporter and they adopt that instead.
I'm not so sure to be honest. It is very expensive. But being able to quickly get all of the features is fantastic for a small company. As you start to use more volume (and your bill starts growing) then it makes sense to start considering self-hosting. But for many teams with relatively small data volumes self-hosting is likely not worth the engineering time.
Likewise for New Relic. Unfortunately OTel and everything similar doesn't remotely come close to the usability and ease of configuration that New Relic does. Not yet, anyway...
Ideally you just implement OTel into your apps and later plug that into any metric/log/trace collector you want, so you have DataDog and Newrelic and other frontends competing based on their feature suite and price.
As it stands, this was/is trying to use DataDog's existing APM libraries to turn them into OTel for ingest into other providers.
I want to go for a drink with Moonheart (who used this unmerged, untested, year old PR in Production for a year) and find out how I, too, can become an absolute maverick.
datadogs APM traces are the one thing that locks companies into Datadog. If this is merged it makes it a lot easier to move out of datadog into something else.
The only possible lock-in I see is if you have annotated code with Datadog specific tracing instructions and I'm pretty sure that most of these features aren't relevant for Open Telemetry.
Switching APM providers isn't hard at all. Maybe the Open Telemetry ecosystem lacks a good agent but my understanding is that their APM agent is pretty much already a copy of the Datadog agent.
I think New Relic would be a more interesting agent to use because their approach to APM isn't primarily to do sampled traces.
A little lost here as well but my read was that the PR author was working with outside parties (their company??) to get approval (sign a CLA?) and some third party asked for Datadog's opinion and Datadog advised that they wait for a Datadog feature. Due to this discussion, the third party was in "default no" and the PR author did not push the matter, expecting Datadog to release something, when they did not after several months, the author reopened and pushed to get approval.
I could be totally wrong, but maybe combined we can make sense of it. From my read it feels like DD misguided this third party which then advised the author to spike.
We switched to OpenTelemetry agent from DataDog agent. Out of all commercial providers, Datadog is the worst one to show traces properly. Although it captures mostly the same traces, all of them have io.opentelemetry.something on UI making it very hard to read. I’ve tried Honeycomb, NewRelic, Lightstep.
> Speaking as a Collector contrib maintainer, I just wanted to say that I am not going to continue reviewing this PR or start reviewing other future PRs related to the Datadog APM receiver to avoid any semblance of conflict of interest given my role both as a maintainer on this repository and as a member of Datadog's OpenTelemetry team.
And from their GitHub profile:
> Open Source Software Engineer at Datadog , focusing on OpenTelemetry
What's the conflict of interest exactly? You work at Datadog, supposedly to work on OSS, with a focus on OpenTelemetry and you don't want to review Datadog related code for OpenTelemetry? Sounds weird, that kind of profile is exactly the type of person who should be reviewing the code, they have knowledge of both sides of it.
Rather, it sounds like Datadog is walking back and don't want to support OpenTelemetry if it means it'll support their own tooling, instead of just others.