Hacker Newsnew | past | comments | ask | show | jobs | submit | flutetornado's commentslogin

My experience with qwen3.5 9b has not been the same. It’s definitely good at agentic responses but it hallucinates a lot. 30%-50% of the content it generated for a research task (local code repo exploration) turned out to be plain wrong to the extent of made up file names and function names. I ran its output through KimiK2 and asked it to verify its output - which found out that much of what it had figured out after agentic exploration was plain wrong. So use smaller models but be very cautious how much you depend on their output.

There are several useful ways of engineering the context used by LLMs for different use cases.

MCP allows anybody to extend their own LLM application's context and capabilities using pre-built *third party* tools.

Agent Skills allows you to let the LLM enrich and narrow down it's own context based on the nature of the task it's doing.

I have been using a home grown version of Agent Skills for months now with Claude in VSCode, using skill files and extra tools in folders for the LLM to use. Once you have enough experience writing code with LLMs, you will realize this is a natural direction to take for engineering the context of LLMs. Very helpful in pruning unnecessary parts from "general instruction files" when working on specific tasks - all orchestrated by the LLM itself. And external tools for specific tasks (such as finding out which cell in a jupyter notebook contains the code that the LLM is trying to edit, for example) make LLMs a lot more accurate and efficient, efficient because they are not burning through precious tokens to do the same and accurate because the tools are not stochastic.

With Claude Skills now I don't need to maintain my home grown contraption. This is a welcome addition!


I prefer it because it forces distillation to core ideas, consumable quickly. Busy people have too little time to read too much verbiage.


And there is a mutually understood degree of nuance. There is no space to consider every route of uncertainty or qualify every statement. You can say "the Earth is round" instead of "most of us agree that the Earth very very likely exists and is very likely to be round".


I’d do that every time I get a chance! Ex-HPE black label on my resume from a startup I used to work in that they bought. That company is a complete horror show.


It was weird. It makes me sad because the startup I worked at was really gelling despite the HPE interference. Then they just laid everyone off one day (multiple senior leadership changes later) for no apparent reason.

All the code is Apache 2 so I guess if I really cared I could just revive it... and as it turns out, I don't care that much. Other stuff to do.


Everyone in my entire team - best of engineering as well as every manager left. Underpaying and over subscribing people has become a hallmark over there - it's just a body shop now. Engineers are just numbers on a sheet, to be exploited, chewed and cast aside when they eventually burnout. Upper management has no vision and everyone's constantly firefighting and struggling to catch up with competitors who had long term vision to invest in engineering teams, tooling and infrastructure to scale up the products and people. They want to do in 2 years what took Google and Amazon a couple of decades. Result post-HPE: poor quality, unscalable, cobbled together, barely functional codebase. Before, the startup I worked for had a well balanced rare combination of high performance, modular and well architected codebase. Later the constant push to ship as fast as possible to catch up with competition, completely destroyed the whole thing - teams, codebase and infrastructure. All because they only know how to react and have no idea how to stay ahead of the curve. Buying startups has become their only means of survival as talent stays away from their brand and the only way to justify value to shareholders is to jump from one rock to another, hoping the new one will rocket them away from the black hole they are spiraling into - all they manage to do is stick to the new rock and pull it with them as fast as they were going into the hole they will eventually vaporize in.


"Engineers are just numbers on a sheet, to be exploited, chewed and cast aside when they eventually burnout."

This is exactly how Epic the Electronic Medical Record company operates, but on new college grads instead of Engineers.


Most of the industry and most series C/D startups are like that. It’s a sad state boys and girls. Once you’ve been here long enough, disillusionment sets in. Corporate greed, (em)powered by shareholder greed, takes top priority.


What do you do then?

I’ve been out of the work rat race for over 3 years now, but I’ll have to go back within a year… and I’m dreading it.

It’s my most valuable skill set, I just want to throw up when I see what the industry has become and I don’t know how to deal with it.


I'm always a little embarrassed when I go to the doctor and tell them I'm a software engineer. I know your EMR system is terrible but if I had done it it would have been better. Sorry :(

(My primary care doctor's office was venture-funded at one point and they actually have a great system. But all my specialists are on MyChart and everything there is always a disaster. Doesn't even have a "preferred name" field, so it has to be noted on my records on a case by case basis and it's ... inconsistent.)


Wow


What was it? Since the source code is open source, you probably wouldn't mind telling?


I think the performance in terminal apps could come from the fact that terminals can use pre rendered, pre cached in GPU, text glyphs covering a fixed size small grid (of glyph sized blocks instead of pixels.) At least that’s what I have wondered - because the terminal UI experience has been always a lot more slick compared to heavy GUI based programs. Feel free to correct me if someone has done actual performance analysis.


I was able to compile ollama for AMD Radeon 780M GPUs and I use it regularly on my AMD mini-PC which cost me 500$. It does require a bit more work. I get pretty decent performance with LLMs - just making a qualitative statement as I didn't do any formal testing, but I got comparable performance vibes as a NVIDIA 4050 GPU laptop I use as well.

https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M...


Same here on lenovo thinkpad 14s with AMD Ryzen™ AI 7 PRO 360 that has a Radeon 880M iGPU. Works OK on ubuntu.

Not saying it works everywhere but it wasn't even that hard to setup, comparable to cuda.

Hate the name though.


Nobody will come after you for omitting the tm


You never know


My understanding is that top_k and top_p are two different methods of decoding tokens during inference. top_k=30 considers the top 30 tokens when selecting the next token to generate and top_p=0.95 considers the top 95 percentile. You should need to select only one.

https://github.com/ollama/ollama/blob/main/docs/modelfile.md...

Edit: Looks like both work together. "Works together with top-k. A higher value (e.g., 0.95) will lead to more diverse text, while a lower value (e.g., 0.5) will generate more focused and conservative text. (Default: 0.9)"

Not quite sure how this is implemented - maybe one is preferred over the other when there are enough interesting tokens!


They both work on a sorted list of tokens by probability. top_k selects a fixed amount of tokens, top_p selects the top tokens until the sum of probabilities passes the threshold p. So for example if the top 2 tokens have a .5 and .4 probability, then a 0.9 top_p would stop selecting there.

Both can be chained together and some inference engines let you change the order of the token filtering, so you can do p before k, etc. (among all other sampling parameters, like repetition penalty, removing top token, DRY, etc.) each filtering step readjusts the probabilities so they always sum to 1.


GPU workloads are either compute bound (floating point operations) or memory bound (bytes being transferred across memory hierarchy.)

Quantizing in general helps with the memory bottleneck but does not help in reducing computational costs, so it’s not as useful for improving performance of diffusion models, that’s what it’s saying.


Exactly. The smaller bit widths from quantization might marginally decrease the compute required for each operation, but they do not reduce the overall volume of operations. So, the effect of quantization is generally more impactful on memory use than compute.


Except in this case they quantized both the parameters and the activations leading to decreased compute time too.


Altair is superb. Have used it a lot and it has become my default visualization library. Works in VSCode and Jupyter Lab. The author has a great workshop video on youtube for people interested in altair. I especially like the ability to connect plots with each other so that things such as selecting a range in one plot changes the visualization in the connected plot.

One possible downside is that it embeds the entire chart data as json in the notebook itself, unless you are using server side data tooling, which is possible with additional data servers, although I have not used it, so cannot say how effective it is.

For simple plots its pretty easy to get started and you could do pretty sophisticated inter plot visualizations with it as you get better with it and understand its nuances.


Awesome to hear that you like Vega Altair. With the recent integration of VegaFusion you don’t need to embed the data in the notebook anymore and I’ve found Altair to scale quite well. Give it a shot.


Try neogit plugin for neovim. Its a work in progress.

https://github.com/NeogitOrg/neogit


Good to hear, thank you!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: