I think that when someone uses ls instead of a glob it means they most probably ...

dotancohen · on June 25, 2024

  > I think that when someone uses ls instead of a glob it means they most probably don't understand shell.

In 25 years of using Bash, I've picked up the knowledge that I shouldn't parse the output of ls. I suppose that it has something to do with spaces, newlines, and non-printing characters in file names. I really don't know.

But I do know that when I'm scripting, I'm generally wrapping what I do by hand, in a file. I'm codifying my decisions with ifs and such, but I'm using the same tools that I use by hand. And ls is the only tool that I use to list files by hand - so I find it natural that people would (naively) pick ls as the tool to do that in scripts.

AtlasBarfed · on June 25, 2024

What I don't understand is why in the two most popular Unix flavors we have not got something like a json list output or something else that is parsable.

Is it really that difficult to add --json as a flag?

Jasper_ · on June 26, 2024

The soul of UNIX is to create a confused and in some cases unparseable text-based format from what was already structured data.

out-of-ideas · on June 26, 2024

i think it boils down to then is dependencies needed to parse the json, coupled with the fact that glob syntax already covers iterating over the files regardless of characters used in the filename.

there are other tools than `ls` with their soul purpose to list files; some have "improved" features than ls, ect

similarlly from above (and well said btw, even what is not quoted): > I do know that when I'm scripting, I'm generally wrapping what I do by hand, in a file. I'm codifying my decisions with ifs and such, but I'm using the same tools that I use by hand

a lot of us do similar and we know/expect the ins and outs; and all it takes to break our scripts is some edge case we never thought of. we are fortunate that our keyboard layouts are basically ascii; other languages are less fortunate. now introduce open source community driven software where an ls escaped bash code deletes somebodys home directory (as ls was parsed and an edge case of some users files cause some obscure "fun" times). an edge case is still painful..

and finally, sometimes its better elevating said bash script to python (or awk), ect. just depends on the situation and level of complexity of logic

AtlasBarfed · on June 27, 2024

Globs don't help much when I want file attributes and sizes. Yeah I can pipe to something that can do it, but it would be nice to just get filenames, attrs, size, dates in a json array as output.

Look ls is one of the most basic and natural Unix commands. Make it modern and useful.

Bash gibberish is fun for gatekeeping scripting neckbeards, but it's not what a proper OS should have.

out-of-ideas · on June 30, 2024

the thing with the coreutils is they provide basic core functionality; you dont need bash on your system - `ls` is not bash (and then you still end up with busybox where json still would not be part of ls). add more utilities to your system to do more complex logic; ive used similar apps to this in the past: https://github.com/kellyjonbrazil/jc

there's also using zero terminiated lines in ls with `--zero`; then piping that to a number of apps which also support similar (read,xargs,ect)

might also checkout powershell on linux which may suite your needs where instead of string manipulation, everything is a class object

TristanBall · on June 27, 2024

It's not in any of the major distress but shout out to csv-nix-tools for a valiant effort in this space https://github.com/mslusarz/csv-nix-tools

notnmeyer · on June 25, 2024

exactly—well said

Pesthuf · on June 25, 2024

People new to *nix make the mistake of thinking this stuff is well designed, makes sense and that things work well together.

They learn. We all do.

lukan · on June 25, 2024

Coincidently I discovered the unix haters handbook today:

https://web.mit.edu/~simsong/www/ugh.pdf

pikzel · on June 25, 2024

"The Macintosh on which I type this has 64MB: Unix was not designed for the Mac. What kind of challenge is there when you have that much RAM?"

Love it.

Narishma · on June 25, 2024

I don't understand what they mean in that quote. Neither Unix nor the Mac were designed for that much RAM.

II2II · on June 25, 2024

Judging from the context, the user interface was fine in the days of limited resources (a 16 kiloword PDP-11 was cited) but then modern computers have the resources for better user interfaces.

They clearly didn't realize that even more modern Unix kernels would require hundreds of megabytes just to boot.

paulddraper · on June 26, 2024

What kernel takes 200 MB+ to boot?

nsguy · on June 25, 2024

OT ... I worked with Simson briefly ages ago. Smart dude. This book happened later and I've never seen it before. Small world I guess.

boricj · on June 25, 2024

People new to *nix don't realize that it's a 55 year old design that keeps accumulating cruft.

hifromwork · on June 25, 2024

Of course, but the same (with a bit lower number of years) can be said about Windows, or HTTP, or the web with its HTML+JS+CSS unholy trinity, or email, or anything old and important really. It's scary how much of our modern infrastructure hinges on hacks made tens of years ago.

com2kid · on June 25, 2024

One of the original demos showing off PowerShell was well structured output from its version of ls.

That was 17 years ago!

larodi · on June 25, 2024

People new to the internet think alike. Still, not a day passes and we are once again reminded how fragile yet amazing this all information theory stuff is.

PaulHoule · on June 25, 2024

I went through a phase when I really enjoyed writing shell scripts like

  ls *.jpg | awk '{print "resize 200x200 $1 thumbnails/$1"}' | bash

because I never got to the point where I could remember the strange punctuation that the shell requires for loops without looking up the info pages for bash whereas I've thoroughly internalized awk syntax.

Word is you should never write something like that because you'll never get the escaping right and somebody could craft inputs that would cause arbitrary code execution. I mean, they try to scare you into using xargs, but I find xargs so foreign I have to read the whole man page every time I want to do something with it.

projektfu · on June 25, 2024

Better is something like

  find . -maxdepth 1 -name "*.jpg" -exec resize 200x200 "{}" "thumbnails/{}" \;

which works for spaces and probably quotes in filenames I am not sure about other special characters.

mjevans · on June 25, 2024

It's tough to be portable and have a one liner See https://stackoverflow.com/questions/45181115/portable-way-to...

I switched the command to a graphics magick based resize since that's the tool these days, default quality is 75% (for JPEG), but is included as a commonly desired customization. ,, is from a different comment in this thread; it seems better self-documenting than the single , I'd traditionally use.

  find . -maxdepth 1 -name "*.jpg" -print0 |\
  xargs -0P $(nproc --all) -I,, gm convert resize '200x200^>' -quality 75 ,, "thumbnails/,,"

hifromwork · on June 25, 2024

I encourage you to give it a try again. Almost every use of xargs that I ever did looked like this:

ls *.jpg | xargs -i,, resize 200x200 ,, thumbnails/,,

I just always define the placeholder to ,, (you can pick something else but ,, is nice and unique) and write commands like you do.

kstrauser · on June 25, 2024

I'm more likely to write that like:

  for i in *.jpg; resize 200x200 "$i" "thumbnails/$i"; end

TylerE · on June 25, 2024

Does that not fail when you hit the maximum command line length? Doesn't the entirety of the directory get splatted? Isn't this the whole reason xargs exists?

lelandbatey · on June 25, 2024

No, it does not fail. Maximum command line length exists in the operating system, not the shell; you can't launch a program with too many argc and you can't launch a program with an argv that's a string that's too long.

But when you execute a for loop in bash/sh, the 'for' command is not a program that is launched; it's a keyword that's interpreted, and the glob is also interpreted.

Thus, no, that does not fail when you hit the maximum command line length (which is 4096 on most _nix). It'll fail at other limits, but those limits exist in bash and are much larger. If you want to move to a stream-processing approach to avoid any limits, then that is possible, while probably also being a sign you should not use the shell.

kstrauser · on June 25, 2024

That's right. I tested this just now in a directory with 1,000,000 files:

  $ for i in *; do echo $i; done | wc -l
  1000000

I'm a little bummed that it failed in fish shell, but wouldn't begrudge the author if they replied "don't do that".

genrilz · on June 25, 2024

The for loop only runs resize once per file. So no, the entire directory does not get splatted. It is unlikely you'd hit maximum command length.

At least on mac, the max command length is 1048576 bytes, while the maximum path length in the home directory is 1024 bytes. There might be some unix variant where the max path length is close enough to the max command length to cause an overflow, but I doubt that is the case for common ones.

xargs exists in an attempt to be able to parse command output. You could for instance have awk output xargs formatted file names to build up a single command invocation from arbitrary records read by awk. Note that xargs still has to obey the command line length limit though, because the command line needs to get passed to the program. Thus, in a situation where this for loop overflows the command line, it would cause xargs to also fail. Thus I would always use globbing if I have the choice.

EDIT: If you mean that the directory is splatted in the for loop, then in a theoretical sense it is. However, since "for" is a shell builtin, it does not have to care about command line length limits to my knowledge.

TylerE · on June 25, 2024

Yes, this is an issue, absolutely.

I've seen some image directories with more than a million files in them.

genrilz · on June 25, 2024

This shouldn't overrun the command line length for resize, since resize only gets fed one filename at a time. I do think that the for loop would need to hold all the filenames in a naive shell implementation. (I would assume most shells are naive in this respect) The for loop's length limit is probably the amount of ram available though. I find it improbable that one could overflow ram with purely pathnames on a PC, since a million files times 100 chars per file is still less than a gig of ram. If that was an issue though, one would indeed have to use "find" with "-exec" instead to make sure that one was never holding all file names in memory at the same time.

PaulHoule · on June 25, 2024

Exactly, there are so many limits in the shell that I don’t want to be bothered to think about. When I get serious I just write Python.

englishspot · on June 25, 2024

I just use find. it's a little longer but gives me the full paths and is more consistent. also works well if you need to recurse. something like `find . -type f | while read -r filepath; do whatever "${filepath}"; done`

stouset · on June 25, 2024

I love this example, because it highlights how absolutely cursed shell is if you ever want to do anything correctly or robustly.

In your example, newlines and spaces in your filenames will ruin things. Better is

    find … -print0 | while read -r -d $'\0'; do …; done

This works in most cases, but it can still run into problems. Let's say you want to modify a variable inside the loop (this is a toy example, please don't nit that there are easier ways of doing this specific task).

    declare -a list=()

    find … -print0 | while read -r -d $'\0' filename; do
        list+=("${filename}")
    done

The variable `list` isn't updated at the end of the loop, because the loop is done in a subshell and the subshell doesn't propagate its environment changes back into the outer shell. So we have to avoid the subshell by reading in from process substitution instead.

    declare -a list=()

    while read -r -d $'\0' filename; do
        list+=("${filename}")
    done < <(find … -print0)

Even this isn't perfect. If the command inside the process substitution exits with an error, that error will be swallowed and your script won't exit even with `set -o errexit` or `shopt -s inherit_errexit` (both of which you should always use). The script will continue on as if the command inside the subshell suceeded, just with no output. What you have to do is read it into a variable first, and then use that variable as standard input.

    files="$(find … -print0)"
    declare -a list=()

    while read -r -d $'\0' filename; do
        list+=("${filename}")
    done <<< "${files}"

I think there's an alternative to this that lets you keep the original pipe version when `shopt -s lastpipe` is set, but I couldn't get it to work with a little experimentation.

Also be aware that in all of these, standard input inside the loop is redirected. So if you want to prompt a user for input, you need to explicitly read from `/dev/tty`.

My point with all this isn't that you should use the above example every single time, but that all of the (mis)features of shell compose extremely badly. Even piping to a loop causes weird changes in the environment that you now have to work around with other approaches. I wouldn't be surprised if there's something still terribly broken about that last example.

genrilz · on June 26, 2024

You have really proven your point even more than you meant to. Unfortunately none of these examples are robust.

The "-r" flag allows backslash escaping record terminators. The "find" command doesn't do such escaping itself, so that flag will cause files with backslashes at the end to concatenate themselves with the next file.

Furthermore, if IFS='' is not placed before each instance of read, or set somewhere earlier in the program, than each run of white-space in a filename will be converted into a single space.

EDIT: I proved your point even more. The "-r" flag does the opposite of what I thought it did, and disables record continuation. So the correct way to use read would be with IFS='' and the -r flag.

stouset · on June 26, 2024

Love it. And I wouldn’t be surprised in the least if even this fell apart in some scenarios too.

chasil · on June 25, 2024

Wow, you people are really young.

http://www.etalabs.net/sh_tricks.html

gnuvince · on June 25, 2024

Is there a reason to prefer `while read; ...;done` over find's -exec or piping into xargs?

xolox · on June 25, 2024

Both `find -exec` and xargs expect an executable command whereas `while read; ...; done` executes inline shell code.

Of course you can pass `sh -c '...'` (or Bash or $SHELL) to `find -exec` or xargs but then you easily get into quoting hell for anything non-trivial, especially if you need to share state from the parent process to the (grand) child process.

You can actually get `find -exec` and xargs to execute a function defined in the parent shell script (the one that's running the `find -exec` or xargs child process) using `export -f` but to me this feels like a somewhat obscure use case versus just using an inline while loop.

retrogeek · on June 25, 2024

I will sometimes use the "| while read" syntax with find. One reason for doing so is that the "-exec" option to find uses {} to represent the found path, and it can only be used ONCE. Sometimes I need to use the found path more than once in what I'm executing, and capturing it via a read into a reusable variable is the easiest option for that. I'd say I use "-exec" and "| while read" about equally, actually. And I admittedly almost NEVER use xargs.

ykonstant · on June 25, 2024

This will fail for files with newlines.

michaeldh · on June 25, 2024

How common are they?

anamexis · on June 25, 2024

This whole post is about uncommon things that can break naive file parsing.

DonHopkins · on June 25, 2024

When you don't want to waste your time and sanity and happiness being in doubt and then throwing away all you've done and switching to a new language in mid stream, just don't even start using a terribly crippled shell scripting language in the first place, also and especially including awk.

The tired old "stick to bash because it's already installed everywhere" argument is just as weak and misleading and pernicious as the "stick to Internet Explorer because it's already installed everywhere" argument.

It's not like it isn't trivial to install Python on any system you'll encounter, unless you're programming an Analytical Engine or Jacquard Loom with punched cards.

cassianoleal · on June 25, 2024

In most places where I run shell scripts, there is no Python. There could be if I really wanted it but it's generally unnecessary waste.

On top of it, shell is better than Python for many things, not to mention faster.

It's also, as you mentioned, ubiquitous.

In the end, choose the tool that makes more sense. For me, a lot of the time, that's a shell script. Other times it may be Python, or Go, or Ruby, or any of the other tools in the box.

DonHopkins · on June 25, 2024

A waste of what, disk space? I'd much rather waste a few megabytes of disk space than hours or days of my time, which is much more precious. And what are you doing on those servers, anyway? Installing huge amounts of software, I bet. So install a little more!

For decades, on most Windows computers I run web browsers, there's always Internet Explorer. So do you still always use IE because installing Chrome is "wasteful"? It's a hell of a lot bigger and more wasteful than Python. As I already said, that is a weak and misleading and pernicious argument.

So what exactly is bash better than Python at, besides just starting up, which only matters if you write millions of little bash and awk and sed and find and tr and jq and curl scripts that all call each other, because none of them are powerful or integrated enough to solve the problem on their own.

Bash forces you to represent everything as strings, parsing and serializing and re-parsing them again and again. Even something as simple as manipulating json requires forking off a ridiculous number of processes, and parsing and serializing the JSON again and again, instead of simply keeping and manipulating it as efficient native data structures.

It makes absolutely no sense to choose a tool that you know is going to hit the wall soon, so you have to throw out everything you've done and rewrite it in another language. And you don't seem to realize that when you're duct-taping together all these other half-assed languages with their quirky non-standard incompatible byzantine flourishes of command line parameters and weak antique domain specific languages, like find, awk, sed, jq, curl, etc, you're ping-ponging between many different inadequate half-assed languages, and paying the price for starting up and shutting down each of their interpreters many times over, and serializing and deserializing and escaping and unescaping their command line parameters, stdin, and stdout, which totally blows away bash's quick start-up advantage.

You're arguing for learning and cobbling together a dozen or so different half-assed languages and flimsy tools, none of which you can also use to do general purpose programming, user interfaces, machine learning, web servers and clients, etc.

Why learn the quirks and limitations of all those shitty complex tools, and pay the cognitive price and resource overhead of stringing them all together, when you can simply learn one tool that can do all of that much more efficiently in one process, without any quirks and limitations and duct tape, and is much easier to debug and maintain?

cassianoleal · on June 25, 2024

> For decades, on most Windows computers I run web browsers, there's always Internet Explorer. As I already said, that is a weak and misleading and pernicious argument.

On its own, I agree. But you glossed over everything else I said, so I'm not going to entertain your weak argument.

You seem to ignore that different users, different use cases, different environments, etc. all need to be taken into account when choosing a tool.

Like I said, for most of my use cases where I use shell scripting, it's the best tool for the job. If you don't believe me, or think you know better about my circumstances than I do, all the power to you.

sqeaky · on June 25, 2024

> You seem to ignore that different users, different use cases, different environments, etc. all need to be taken into account when choosing a tool.

I have worked on projects that are extremely sensitive to extra dependencies and projects that aren't.

Sometimes I am in an underground bunker and each dependency goes through an 18 month Department of Defense vetting process, and "Just install python" is equivalent to "just don't do the project". Other times I have worked on projects where tech debt was an afterthought because we didn't know if the code would still be around in a week and re-writing was a real option, so bringing in a dependency for a single command was worthwhile if we could solve the problem now.

There is appetite for risk, desire for control, need for flexibility, and many other factors just as you stated that DonHopkins is ignoring or unaware of.

nolist_policy · on June 25, 2024

Plus jq and curl might not even be installed. And I never got warm with jq, so if I need to parse json from shell I reach for... python. Really.

radiator · on June 25, 2024

Alternatively, maybe you can get warmer with JMESPath, which has jp as its command line interface https://github.com/jmespath/jp .

The good thing about the JMESPath syntax is that it is the standard one when processing JSON in software like Ansible, Grafana, perhaps some more.

mywittyname · on June 25, 2024

I'm an avid jq user. There are certainly situations where it's better to use python because it's just more sane and easier to read/write, but jq does a few things extremely well, namely, compressing json, and converting json files consisting of big-ass arrays into line delimited json files.

stephenr · on June 25, 2024

One advantage: `ls -i` gives you the file's inode in a POSIX portable way. If you glob and then look it up individually for each file, you'll need to be aware of which tool (and whether it's GNU or BSD in origin) you use on which platform.

In general yes globbing is better for iterating through files. But parsing `ls` doesn't necessarily mean the author doesn't know shell. It might mean they know it well enough to use the tools that are made available to them.

chrsig · on June 25, 2024

Commands can have a maximum number of arguments. Try globbing on a directory with millions of files.

arp242 · on June 26, 2024

Usually the pattern is "for f in [glob]", which doesn't have that issue. Running "ls" on a directory is little more than "for f in *; echo $f" so there's little advantage to using "ls".

Also: "find -exec {} \+" will take ARG_MAX into account, and may be much faster depending on what you're doing.

anthk · on June 25, 2024

Sane people will just use find and/or xargs.

nmz · on June 26, 2024

Weird thing to call sane when its the shell that is insane, or more likely an instrument of torture.

seanhunter · on June 26, 2024

It's not the case very often these days but it used to be quite simple to blow up your script globbing in a directory with a lot of files and you can still hit the limit if you pass a glob to some command because it can blow up trying to execve() Here's more details of the issue and some workarounds https://unix.stackexchange.com/questions/120642/what-defines...

CJefferson · on June 25, 2024

Sometimes I want all filenames from a subdirectory, without the subdirectory name.

I can do (ignoring parsing issues):

    for name in $(cd subdir; ls); do echo "$name"; done

This isn't easy to do with globbing (as far as I know)

rascul · on June 25, 2024

One alternative:

  for name in subdir/*; do basename "$name"; done

Izkata · on June 25, 2024

Also since subdir is hardcoded, you can reliably type it a second time to chop off however much of the start you want:

  for name in subdir/subsubdir/*; do
    echo "${name#subdir/}"  # subsubdir/foo
  done

xolox · on June 25, 2024

Note this string replacement is not anchored (right?) which can end up biting you badly (depending on circumstances of course).

chuckadams · on June 25, 2024

It's anchored on the left. ${name#subdir/} will turn 'subdir/abc' into 'abc', but will not touch foo/subdir/bar. I don't think bash even has syntax to replace in the middle of an expansion, I always pull out sed for that.

xolox · on June 25, 2024

Thanks for clarifying, I learned something new today!

Edit: It turns out that Bash does substitutions in the middle of strings using the ${string/substring/replacement} and ${string//substring/replacement} syntax, for more details see https://tldp.org/LDP/abs/html/string-manipulation.html

chasil · on June 25, 2024

This is really easy to do with a shell pattern.

  $ x=/some/really/long/path/to/my/file.txt
  $ echo "${x##*/}"
  file.txt

silvestrov · on June 25, 2024

I'd really like if the "find" command supported this much easier, so if I write

    find some/dir/here -name '*.gz'

then I could get the filenames without the "some/dir/here" prefix.

It would also be nice if "find" (and "stat") could output the full info for a file in JSON format so I could use "jq" to filter and extract the needed info safely instead of having to split whitespace seperated columns.

ykonstant · on June 25, 2024

Why would you do this work when stat (and GNU find) can `printf` the exact needed information without any parsing?

silvestrov · on June 25, 2024

If I need filesize and filename then I still need to parse a filename that might contain all kinds of weird ascii control characters or weird unicode.

JSON makes that a lot less fragile.

ykonstant · on June 25, 2024

I don't get it; I need a concrete example.

genrilz · on June 26, 2024

It's at least pretty easy to shorten the filenames:

  cd some/dir/here
  find . -name '*.gz'
  cd - # changes back to previous directory

williamcotton · on June 25, 2024

What about:

  find . -name '*.hs' -exec basename {} \;

rascul · on June 25, 2024

You could get mixed up here because find is recursive by default and basename won't show that files might be in different subdirectories.

hawski · on June 25, 2024

If you are gonna do a subshell (cd subdir; ls) you can wrap the whole loop:

  (cd subdir
  for name in *; do
    echo "$name"
  done)

But I prefer:

  for name in subdir/*; do
    name="${name#*/}"
    echo "$name"
  done