Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Why do we need modules at all? (2011) (erlang.org)
155 points by thomas11 on Nov 7, 2014 | hide | past | favorite | 76 comments


Armstrong's proposal reminds me a bit of Emacs extensions. Since Emacs Lisp doesn't have namespaces or modules, all functions must be uniquely named which is done by prefixing them: foo-replace. This is not that different from having a module foo, as Armstrong notes: "managing a namespace with namees like foo.bar.baz.z is just as complex as managing a namespace with names like foo_bar_baz_z".

But what it enabled is an Emacs community where single functions are freely shared, for example on http://www.emacswiki.org/emacs/. People just copy them into their Emacs init file. Sometimes they modify them a little and post them again with their own prefix. This has obvious downsides such as lack of versioning and organization. But it provides a low barrier to entry and creates a dynamic community.


To me this is the same question whether we need directories or not in a file system. Ideally, your file system is a flat database and files are indexes by a vast array of automatic and manually added metadata that allows to easily retrieve them. Microsoft tried to go this direction with WinFS that was eventually cut for Vista, maybe because it wasn't practical (yet). Looking how people use the Internet though, where 90% of browsing will start at Google, this does seem a very reasonable approach for many things in the future. At the end, why should humans do manual indexing and retrieval if the computer can facilitate this part?


I think a lot of people are focusing on the implementation details here, which is fun and great, but the real deep insight here is the idea of a global registry of correct functions.

If you postulate for a minute that the (truly nontrivial) surface problems are all solved, and concentrate only on the idea of a universally accessible group of functions that accretes value over time -- like a stdlib that every language on every runtime could access -- that seems like a pretty exciting idea worth thinking about.

I had something like that idea almost two decades ago (http://www.gossamer-threads.com/lists/perl/porters/26139?do=...) but at the time it was all in fun. But these days, that sort of thing starts looking pretty possible, especially for the group of pure functions.


Because humans suck at serialized content.

7 +- 2. [1] That's the number of things our prefrontal cortex/short term memory can track at once. That's why we (humans) organize things into hierarchies. That's why the best team size is around that number. Etcetera.

Heck, everything in the world on a computer is serialized into memory or onto disk. Or addressed as some disk in a serial array of disks. Serialized as in, "there's some data somewhere in these 2TB that tell me where in the same 2TB the rest of the data is." Computers excel at this. Humans are terrible at this.

I guess my point is, humans are the reasons we need modules.

[1] http://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus_...


That is exactly, what I thought, too!

It is all about handling complexity! By putting together things that belong together, complexity is reduced. Also engines and other machines are designed that way: Things belonging together, are put together in the same spot.


I quite like the idea. I think it would probably still make sense to have "collections" where a bunch of related functions can be grouped together, discovered and worked on as a unit (this would just be an optional extra layer on top of the global function database). Although there would no exclusivity in collections so a function might appear in more than one, or zero, collections.

Another idea: Unit tests could be stored as function metadata.


JavaScript works similar to this this and apps/libraries that wrap themselves in a giant closure work almost exactly like this. The disadvantage of this over using modules is in dependencies between functions. When you don't have modules and you try to refactor you get this annoying tendency for function a in file b to break when you change function y in file z. When you have modules you can easily tell before changing function a whether it is exported or not, and if it is to see in file z wither file a is imported.

Not saying this Erlang idea isn't good or wont work, just these are the pitfalls besides the obvious name spacing and conflicts.


JavaScript works similar to this this and apps/libraries that wrap themselves in a giant closure work almost exactly like this.

Nope. Joe's thought experiment is: what if every function became available in the global namespace? What if every function got kept in a global datastore so you: launch your REPL, run any function, and have it pulled down and work immediately. No imports unless you need to pin a specific function to a specific past revision.

Of course, someone would come along and say "these 30 functions only work together when pinned to these specific revisions," so you end up pulling down a named bundle of specific revisions, ...


This problem can happen in any codebase, though I can see the argument that it might be more prevalent without modules. I can also see a lot of solutions: modifying functions to always be backward-compatible and defining the behavior of a function with a thorough unit test.


I saw Joe's strange loop talk [1] a while ago and I get the same vibe reading his post as I did when watching the video. It sounds very cool, but I can't shake the feeling that it only works for 85% of the code. That is to say if you program in exactly the right way, you will be able to do everything you want and it will work with this system, but there are ways of programming that won't work with this system.

More specifically I feel like there are two problems. 1) It feels suspiciously like there's a combination of halting problem and diagonalisation that shows there are an uncountably infinite number of functions that we want to write that can't be named (although I would want to have a better idea of how this is supposed to work before I try to hammer out a proof). 2) I don't understand how it's possible for any hashing scheme to encode necessary properties of a function such that the function with necessary properties has a different hash than an otherwise identical function without these properties. For example can we hash these functions such that stable sort looks different than unstable sort? Wouldn't we need dependent typing to encode all required properties? And if that's the case couldn't I pull a Gödel and show that there's always one more property not encodable in your system?

[1] - https://www.youtube.com/watch?v=lKXe3HUG2l4 [2]

[2] - https://news.ycombinator.com/item?id=8572920 (thanks for the link)


There are countably infinite number of functions. A simple proof is that each function can be represented as a string, and there are countably infinite number of strings for a finite alphabet. You could also argue that functions are equivalent to Turing machines, and there are a finite number of Turing machines.


A function's true name should be its content hash. (Where that content hash is calculated after canonicalizing all the call-sites in the function into content hash refs themselves.) This way:

- functions are versioned by name

- a function will "pull in" its dependencies, transitively, at compile time; a function will never change behaviour just because a dependency has a new version available

- the global database can store all functions ever created, without worrying about anyone stepping on anyone else's toes

- magical zero-install (runtime reference of a function hash that doesn't exist -> the process blocks while it gets downloaded from the database.) This is safe: presuming a currently-accepted cryptographic hash, if you ask for a function with hash X, you'll be running known code.

- you can still build "curation" schemes on top of this, with author versioning, using basically Freenet's Signed Subspace Key approach (sort of equivalent to a checkout of a git repo). The module author publishes a signed function which returns a function when passed an identifier (this is your "module"). Later, they publish a new function that maps identifiers to other functions. The whole stdlib could live in the DB and be dereferenced into cache on first run from a burned-in module-function ref.

- function unloading can be done automatically when nothing has called into (or is running in the context of) a function for a while. Basically, garbage collection.

- you can still do late binding if you want. In Erlang, "remote" (fully-qualified) calls don't usually mean to switch semantics on version change; they just get conflated with fully-qualified self-calls, which are explicitly for that. In a flat function namespace, you'd probably have to make late-binding explicit for the compiler, since it would never be assumed otherwise. E.g. you'd call apply() with a function identifier, which would kick in the function metadata resolution mechanism (now normally just part of the linker) at runtime.

Plug: I am already working on a BEAM-compatible VM with exactly these semantics. (Also: 1. a container-like concept of security domains, allowing for multiple "virtual nodes" to share the same VM schedulers while keeping isolated heaps, atom tables, etc. [E.g. you set up a container for a given user's web requests to run under; if they crash the VM, no problem, it was just their virtual VM.] 2. Some logic with code signing such that calling a function written by X, where you haven't explicitly trusted X, sets up a domain for X and runs it in there. 3. Some PNaCl-like tricks where object files are simply code-signed binary ASTs, and final compilation happens at load-time. But the cached compiled artifact can sit in the global database and can be checked by the compiler, and reused, as an optimization of actually doing compilation. Etc.) If you want to know more, please send me an email (levi@leviaul.com).


I wrote a prototype programming language + distributed function storage system like this a few years ago.

I stopped when the madness became unbearable.

Exhibit A: https://github.com/mattsta/zlang/blob/master/priv/site_macro...

Exhibit B: https://github.com/mattsta/zlang/blob/master/examples/editor...

(unbearable from an "i created a new unmaintainable language pov, nothing bad about the distributed function storage feature.)


Oh my god! This is exactly what I've been working on in my spare time. I was just trying to come up with the specific content ID algorithm for it.

I made a graph-based UI for editing this kind of program since you wouldn't want to deal with hash-based names directly. You can see an example program here: http://nickretallack.com/visual_language/#/f2983238d90bd3e0a...

Currently it runs in JavaScript, but I was looking at other languages that it would be good to compile to, and Erlang seems like a pretty good fit for it.

Here's some other thoughts about it: https://docs.google.com/a/thinair.com/document/d/1WtgfUqN6Sd...

Can we work together?


This sounds great. The issue is when you fix a function you generally want everyone to use your bug fix. This is even more important for secruity issues.

However, sometimes code needs the broken version of some function...

There really is no great universal solution to this stuff.


The second one would be the behaviour you'd get out of the "low-level" contract of the system. The former could be built on top: you could have late-bound references that effectively call a function with a semver constraint. e.g.

    apply(repo_handle_foo, bar_fn, {'~>', 2, 3})
Meaning the system would bind that to a version of bar_fn that exists in the foo repo, and is considered to be in the 2.3.X release series. (Conveniently, given the global code DB, it'd probably always resolve to the very latest thing in that series.)

I would suggest a slight change in thinking, though: at the individual function level, you can version implementation, but you don't version semantics. If you have foo 1.0 and foo 2.0 that do different things, that have different API contracts—then those are simply different functions which should have separate identifiers. The only time a function-contract identifier should have "revisions" is to correct flaws that diverge the implementation from the contract. Eventually, the function perfectly does what the contract specifies; at that point, the function is done. If you want the function to do something else, new contract = new identifier.

(But what if your old implementation works and meets the contract, but is way too slow? This is where you get into alternative implementations of the same contract, and compilers doing global runtime A/B-testing of different implementations as a weird sort of JIT. This is orthogonal to versioning, almost, but it means you can't put the version constraints like "relies on [revision > bugfix] of foo" in your code—because what if the VM is running a foo that never had the bug? So those constraints go in the database itself, and effectively "hide" old versions from being offered as resolution targets, without impeding direct reference.)


the hash must account for bound variables so that fun add(x, y) { return x + y } is the same as fun add(a, b) { return a + b }. Beta-reduction independance.


Right. Identifier names effectively become metadata; the AST would look like numbered SSA references (as in LLVM, which does a similar transformation, to allow the optimizer pass to just pattern-match known AST "signatures" and rewrite them.)


What is the probability to have a hash collision of two functional pieces of code? For example one could possibly replace a hard-coded domain "http://example.com" with "http://exàmlpœ.com" (assuming it results in the same hash), then register exàmlpœ.com and do a man-in-the-middle attack with it.


I am not a cryptographer so maybe one of them can chime in and verify this:

If you use a strong crypto hashing algorithm that would be impossible given current computational resources and their growth for eons into the future. For example, there are no known collisions of SHA-2. No one has found 2 items that have the same SHA-2 hash.

Quantum computers / some break-through algorithm could change that. If that happens all encryption on the internet likely breaks then as well, except where quantum cryptography is being used.


The probability, assuming a properly-designed hash function, is no more likely than one in about 300 undecillion. No algorithmic flaws are known that can would let an attacker take an arbitrary input to MD5, SHA-1, SHA-2, or SHA-3 and construct a hash collision in noticeably better time. There are attacks that let an attacker construct two inputs with the same hash for MD5 and SHA-1, so out of an abundance of caution, we're considering MD5 and SHA-1 both "broken", there are now two successor hashes to SHA-1, and we're hurrying to replace all uses of MD5 and SHA-1 where collisions matter.

If you're not designing your software to worry about cosmic rays, you shouldn't design your software to worry about hash collisions. Just pick a good hash.


I disagree with your threat model. Why do you think that the ability to create two colliding functions with basically arbitrary content does not endanger the end user. E.g. (1) I write (or copy) a useful function f1, (2) create an evil function f2, (3) manipulate the meta data of both functions to create f1' and f2' which are functionally identical to f1 and f2, (4) publish f1'. (5) People start to use f1', (6) and end user requests hash(f1') from the key-value-store, (7) I man-in-the-middle the connection and return f2' instead, (8) the end user executes f2' and is compromised.


First off, because you simply don't have that ability in SHA-2 or SHA-3. If you're designing a new system, don't use MD5 or SHA-1; if you're using a legacy system, organize some sort of orderly panic (just like the CA/Browser forum is doing with SHA-1 certificates).

Second, because it's not "basically arbitrary": the content almost always looks like it's been specifically designed to make room for a hash collision. For binary formats like images and PDFs, it's easy to put a large amount of unrendered data in the file format that isn't visible in a viewer, which is exactly why the newsworthy collisions have been images and PDFs. Even X.509 certificates allow you some room for arbitrary data. For program code, having a bunch of arbitrary data in the middle of the function would look extremely suspicious. (Obviously you shouldn't design the metadata format of your functions to permit enough arbitrary data, but given the talk of alpha conversions, it sounds like the proposal is to hash a canonical, high-information-density representation of the function.)

But again this doesn't come up unless you're using a broken hash. My argument is just that even the "broken" hashes really aren't very broken, so if you're using the non-broken ones, you should basically assume they're perfectly secure.


> First off, because you simply don't have that ability in SHA-2 or SHA-3. If you're designing a new system, don't use MD5 or SHA-1;

Then don't mention MD5 and SHA1 in the first place. The sooner they leave everyone's mind as valid alternatives the better.

> For binary formats like images and PDFs, it's easy to put a large amount of unrendered data in the file format that isn't visible in a viewer

MD5 collisions have gotten much shorter in the past years [1].

> For program code, having a bunch of arbitrary data in the middle of the function would look extremely suspicious. (Obviously you shouldn't design the metadata format of your functions to permit enough arbitrary data,

And you trust the designers of these formats to know this "obvious" fact? You reference code normalization, but there was talk in this thread about keys that are to be associated with the functions to allow updates (and thus included in the hash) and I think it is perfectly valid to include graphics in the documentation of a function.

> My argument is just that even the "broken" hashes really aren't very broken, so if you're using the non-broken ones, you should basically assume they're perfectly secure.

My point is that this "structured collision resistance" is used far too often as a handwave argument why their specific protocol can continue to use a broken hash. (Remember how CAs said the same things about X.509 certificates before Appelbaum, Molnar et al. [2] presented an actual proof-of-concept?) Software developers already have difficulties to distinguish pre-image resistance from collision resistance. Giving them yet another argument to shoot themselves in the foot with is not a good idea.

[1] http://www.win.tue.nl/hashclash/SingleBlock/ and http://marc-stevens.nl/research/md5-1block-collision/

[2] http://www.win.tue.nl/hashclash/rogue-ca/


Yeah, you're right I was unclear about the status of MD5 and SHA-1. Unfortunately they are currently both well-known and faster, so I think there's value in calling them broken -- but I didn't do that effectively.

My argument is roughly that, one, you should listen to cryptographers about what is "broken" and take that advice seriously, and two, having done that, you should realize that the standard of "broken" is so conservative that the possibility collisions in a non-broken hash is not even worth thinking about at an application level. I'm not sure that got across effectively.

Thanks for the links!


a function will never change behaviour just because a dependency has a new version available

Presumably this only works for pure-functional languages?


I'm assuming his suggestion is no updates ever unless you update your dependencies version. many module systems like say npm has a culture of using fuzzy matching which means when you do an npm install again you can easily pull in new versions of libraries that have had upgrades since you last did npm install. I'm of a fan of strict dependencies but some people prefer to make it easier to stay to the latest minor patch or other logical upgrade patterns.


Another option would be have an area that is reserved for meta data about upgrades that can't be modified by page content but is browser level that still has hash verifiable and pages with fixed URLs. But at the top there would be something like: There have been 5 changes to this page since this version. See Changes [link] Go to latest version [Link]


Not even them. A data type could change: the interfaces between functions are what must be versioned e.g. An SML module system.


It would be nice if the hash was local (so functions that have similar structure have similar hashes).

IDE could do "It looks like you're trying to write qsort - we already have it written 10000 times in these functions: ...".


"the global database can store all functions ever created, without worrying about anyone stepping on anyone else's toes"

I've had similar thoughts in the past, though my thought was as a function loader for javascript. Why not just experiment with this as library for javascript functions? Downloading the code from the "global database" would be provided by a function:

getDataByHash(hash) : Blob

To load the code into the return use eval (blech), or back up a step and use script tag + an advanced version of magnet links I'd call snow links:

<script src="snow:?xt=urn:sha2:beaaca..."></script>

There will have to be a dom-scanning/watching js library that the pages load from a normal url which scans the page for snow links, downloads them by calling getDataByHash(), and then swaps in new URLs using URL.createObjectURL(Blob).

This leads me in a somewhat orthogonal direction to your post...

Obviously any element in the page that loads resources would benefit from this (link rel=stylesheet, img, iframe, etc) so the document-scanner should work it's magic on them as well.

To me the tricky part becomes "the global database." There are several ways to implement it. My thought would be to build it as a DHT on top of Web-RTC. I'd look into webtorrent, it has to scale from very small to very large files. Maybe have multiple different DHTs that the scanner will try.

Storing stuff in the DHT ought to be as simple as declaring to the network that data (or solution) of a given hash is known. Clients p2p connect over to the knower (or daisy chain style as a sort of ad-hoc STUN/TURN setup?), and virally spread the data by declaring they too now have the solution for the hash they just downloaded. A CDN can still store a file and can be listed as a fallback provider in the snow url.

As an example application:

Once this DHT exists and p2p is built on top of it should not be hard to ask peers to run the content of a given SHA and return the result (or they will return it's SHA if it's large, effectively memo-izing the function). Something like:

run(hash) : RunOffer

Any peer (AWS or Google etc could implement a peer as well) could respond to that with a offer to run the computation for an amount of p2p currency that scales to billions or trillions of tiny transactions per second.

...

Anyhow this comment is way too long already, but the ideas keep flowing from here. There are a lot of technical challenges for vital features (like how to register (im)mutable alias for hashes and distribute them to the network without using DNS for authoritative top-level namespacing? how get reliable redundancy guarantees for stored data without resorting to a CDN? What to do about realtime steaming of data where data is produced by an ongoing process? grouping multiple devices for the same user?)

All-in-all it seems like being able to get data by the hash of its content from inside of a program (including library code) as easily as loading it from a URL or off the filesystem is pretty useful. It also seems like we can engineer the technology to make this happen, so I think it's inevitable and will happen pretty soon.


I use requirejs for module loading and for my personal projects try to limit each module to a single function. Contrived example:

    define('object/clone', ['object/each'], function(each) {
        /**
         * Shallow copy objects
         *
         * @param {object} - object from which to copy properties
         * @return shallow copy of `obj`
         */
        return function(obj) {
            var clone = {};
            each(obj, function(key, value) {
                clone[key] =  value;
            });
            return clone;
        };
    });


> Clients p2p connect over to the knower (or daisy chain style as a sort of ad-hoc STUN/TURN setup?), and virally spread the data by declaring they too now have the solution for the hash they just downloaded.

This is, in fact, exactly what Freenet does! It's a global DHT acting as a content-addressible store, where there's no "direct me to X" protocol function, only a "proxy my request for X to whoever you suspect has X, and cache X for yourself when returning it to me" function.

(Aside: Freenet also does Tor-like onion-routing stuff between the DHT nodes, which makes it extremely slow to the point that it was discarded by most as a solution for anonymous mesh-networking. But it doesn't have to do that. "Caching forwarding-proxy DHT" and "encrypted onion-routing message transport" are completely orthogonal ideas, which could be stacked, or used separately, on a case by case basis. "Caching forward-proxying DHT" is to the IP layer as "encrypted onion-routing message transport" is to TLS. PersistentTor-over-PlaintextFreenet would be much better than either Tor or Freenet as they are today.)

I agree that this could totally be done as a library in pretty much any language that allows for runtime module loading or runtime code evaluation. And, like you say, there are tons of interesting corollaries.

I think, though, that to be truly useful, you have to push this sort of idea as low in the stack as possible.

Unix established the "file descriptor" metaphor: a seekable stream of blocks-of-octets, some of which (files) existed persistently on disk, some of which (pipes) existed ephemerally between processes and only required resources proportional to the unconsumed buffer, some of which (sockets) existed between processes on entirely separate hosts. Everything in Unix is a file. (Or could be, at least. Unix programmers get "expose this as a file descriptor" right at about the same rate web API designers get REST right.) Unix (and most descendants) are "built on" files.

To truly expose the power of a global code-sharing system, you'd need the OS to be "built on" global-DHT-dereferenceable URNs. There would be cryptographic hashes everywhere in the system-call API. Because your data likely wouldn't just be on your computer (just cached there), you'd see cryptographic signatures instead of ownership, and encryption where a regular OS would use ACL-based policies.

At about the same level of importance that Unix treats "directories", such an OS would have to treat the concept of a data equivalence-mapping manifests (telling you how to get git objects from packfiles, a movie from a list of torrent chunk hashes, etc.)

For any sort of collaborative manipulation of shared state, you'd probably see blocks from little private cryptocurrency block-chains floating about in the DHT, where the "currency" just represents CPU cycles the nodes are willing to donate to one-another's work.

And then (like in Plan9's Fossil), you might see regular filesystems and such built on top of this. But they'd only be metaphors—your files would be "on" your disk to about the same degree that a process's stack is "in" L1 cache. Disks would really just be another layer of the memory hierarchy, entirely page-cache; and marking a memory page "required to be persisted" would shove it out to the DHT, not to disk, since your disk isn't highly available.

but. Designing this as an OS for PCs would be silly. It would have much too far to go to catch up with things like Windows and OSX. Much better to design it to run in userspace, or as a Xen image, or on a rump kernel on server hardware somewhere. It'd be an OS for cloud computers, or to be emulated in a background service on a PC and spoken with over a socket, silently installed as a dependency by some shiny new GUI app.

Which is, of course, the niches Erlang already fits into. So see above :)


Related reading:

ipfs, an optionally-authenticated hash-based global filesystem: http://ipfs.io

subresource integrity: an emerging web standard whereby you can specify a resource by its hash: http://w3c.github.io/webappsec/specs/subresourceintegrity/


Thank you! I had searched but hadn't been able to find this idea this sounds exactly like what I have had bouncing around in my head for a while now!

I just read the w3c spec it seems more like an integrity check, but it so trivial to just use the integrity hash to download the data that the next step is removing the src tags all-together / using them as a fallback.


Yes. The spec includes the option of providing fallback URLs to fetch from, and gives the browser fairly broad freedom (as I read it) to fulfill the request so long as the response data matches the given hash. How this connects up with the other ideas being discussed in this thread isn't immediately clear to me, as DHTs in practice tend to be too slow to block on for most in-browser resources, but I'd definitely call it an intriguing development.


Sounds like we are looking at the same set of problem, just from slightly different starting point!

The reason to build the DHT on top of WEB-RTC rather than just using freenet:

* is freenet brings a lot of extra 'features' that you mention that aren't needed for a lot of use cases, I view freenet more as one potential implementation of the function getDataByHash(hash) * using a URI style notation (snow:?xt=urn:sha2:beaaca...) makes it clear that the data has 1 simple universal address, it's not terribly difficult to write a FUSE layer that mounts /mnt/snow/sha to your favorite sha providing client (possibly just calling through to node.js). * Using web-rtc given you a large platform to get the network kickstarted as the userbase is extremely large, and no install is required.

So I do agree 100% that we should try to build a language (or languages!) that refer to functions by the hash of their source.

I view the programming language as just a sub problem of 2 issues:

1. We have to make DHT that was trivially accessible on all platforms, including, and especially, in the browser (and likely there first).

2. Even with the DHT we need a way to alias/label the hashes. The alias can also be versioned. Immutable aliases would just be fixed to version 0.

So for a language implemented on top of the DHT, the aliases would correspond to the function names.

This would be a function

resolveAlias(alias) : Hash

where alias is something like "alias:dns.biz.jackman.code.spacegame.flySpaceship?v=0"

* v=0 fixes the function to a set version, so even if it is later patched the code doesn't change, there is no v specified then the latest is used, a more complicated resolution scheme might be desired)

* A dns. prefix is used because hopefully someday we can move beyond need to piggy back on dns for setting authority on keys, in which a different Super TLD can be chosen)

The corresponding putAlias(name, hash, signature, versionNumber=0) function that will attempt to associate a name with a hash to the network.

To prevent anyone from naming things under domains they don't control, a simple solution could be to use public/private keys. Vend the public key(s) of those able to set aliases under that (sub)domain as a DNS TEXT record, or even a CNAME. They can then sign the putAlias by hashing salt+name+hash+version with their private key (keeping in mind the need to avoid length extension attacks).

I have thought along the cpu cycles a currency, I haven't been able to think of a way to make that a tradeable & bankable currency. I think you still need to have a notion of currency that just tries to be a currency, albeit it needs to scale down to nano-transactions, BTC might work if it can handle higher volumes of TPS. This is because you might need to add to every api function an extra parameter, bounty, wherein you name a price you are willing to pay for your peers to perform that action. Under normal browsing routines, your client should be doing a lot more work than it asking to have done. Those normal clients should be accumulating currency.

When you are on mobile you might be a drain a on the network, in that case you will simply subsidize that activity from your home laptop, or build it up when you go home at night have your phone plugged in charging and are on a WI-FI (or maybe wifi mesh + UAVs that the network itself purchases to increase it's coverage) network.

One other thing, if you want to run a big data study on a bunch of SHA's and use a lot of compute, then obviously you are going to have to acquire some of this currency since you are a load on the network.

Here's how the hello world for that would look:

var hash = putDataByHash("Hello World") executeRemotely(function () { return getDataByHash(hash) })


What happens if somebody discovers a critical weakness in your hash function?


Is this a bit like blockspring?


Yup - building a universal library of functions. Let me know if you have any questions paul@blockspring.com.


Lambda the Ultimate's discussion http://lambda-the-ultimate.org/node/5079 is pretty interesting.


To answer the question in the title directly, I think modules are to aid reading and discovery.

The fact that it is difficult to decide which module a function belongs in doesn't make them pointless. People who have to read or debug your code use them to quickly zero in on areas of likely interest.


In my experience, telling programmers "all functions must have unique names" means you get a half-ass module system tacked on via common prefixes. In other words, you get "foo_bar_function1", "foo_bar_function2" etc.


While you're talking about Erlang specifically, the concepts you bring up can be applied to programming in general.

Why does Erlang (or any other language) have modules?

The biggest reason for me (and I think the one with the most merit) is for clarity and usability.

Modules exist as ways of grouping units of code by the responsibilities of that code. If you removed this hierarchy, wouldn't things become a lot more difficult to navigate and understand as a developer?


Is the author's use of the term `module` specific to erlang? To me, it sounds like he's advocating for modules that are comprised of a single function, rather than utility belt modules that contain many functions. As I understand it, I agree with what the author proposes, and I feel like a subset of npm already provides what he's talking about. The best example is probably underscore.js versus lodash.js, which both have many functions and a wide API surface area. What's notable is that you can cherry-pick individual lodash functions and depend on a specific version[0]. (Admittedly, I lazily pull in the full lodash module instead of importing only the function(s) I'm using)

Lately, I've been moving more toward the proposed design in my Node.js projects. It keeps individual files concise, makes code sharing trivial, encourages stateless methods, and it makes writing tests a breeze.

[0] https://www.npmjs.org/browse/keyword/lodash-modularized


This is basically what Urbit is doing, among other things.


Is Urbit a real thing, or an elaborate hoax?


It made me curious too. Found this HN post about urbit: https://news.ycombinator.com/item?id=6438320




Probably both.


The problem is now you either have zero data abstraction or uncontrolled data abstraction without even a convention like "these functions work together as a bundle" to save you.

That said, a nice SML module probably could work as the base abstraction here.


The problem with this approach is you need to consider every existing function name in order to define a new one.

The beauty of commonjs modules is they allow you to focus on implementation, rather than identification. All functions can be anonymous, identified only by their path and named at the whims of the caller.


Related to this would be all the cool content addressable third-party meta data. Services could automatically generate pre-compiles of things or alternate optimizations. Or auto complete data, or statistics, test suities, behavioral diffing, example code, documentation, the options are endless.


So, immutability and/or api contract is important here.

If I'm pulling in a function, I want it to do what I think I want. Sometimes I want that to change (get a bug fix), but sometimes I don't (someone introduces a bug, or makes the func more general and introduces slowdowns).

This feels like a job for a content-addressable git-like tool. How about this:

I can discover my function (via whatever means). The function is actually named 8804ea505fda087da53b799434c377f015933707 (the sha-something of it's (normalised?) textual representation).

I then import it into my codebase as "useful_fun". My code reads like:

    useful_fun("do it", "to it")
but I have some kind of dependencies/import record which says that "useful_fun" is actually 8804ea505fda087da53b799434c377f015933707. That means one and only one thing across all time, the func with that hash.

So how do we handle updates? If we want a golang-like model, the developer could run something like "update deps". This would:

- go back to the central repository, looking for updates to 8804ea505fda087da53b799434c377f015933707. It might find 5. Local policy then determines what happens. Could be "always choose the original authors update" or "choose the one with the most votes" or "always ask the dev, showing diffs".

Note that because the unique name is based on the function content, any change to it creates a new item in the db. (Content-addressability, same way git and other systems do it.)

- stuff can be grouped and batched. If I pull in 10 functions tagged with the same project ('module') and they've all been updated, I can say "and do the same with all the others".

- This kind of metadata allows all kind of good stuff. I can subscribe to alerts on the functions I've imported and get told about new versions, or security warnings. This kind of subscription information can be used as a popularity contest to solve the "which fork on github do I want to use" problem?

- people can still publish modules. They now look like a git directory or tree. A git tree is a blob which contains the hashes of the files within it. A 'module' could be a blob which specifies which (immutable) functions are in it.

If we use normalised functions, we've now got a module representation which allows arbitrary functions to be pulled together. At fetch time, we can denormalise into the user's preferred coding style. At push time, we renormalise. We aren't grouping stuff into files, so a 'project' or a 'module' consists solely of the semantic contents, nothing to do with artificial grouping for the file system.

Seems like an interesting future.


I think Mr. Armstrong would approve, given his comments near the end of https://www.youtube.com/watch?v=lKXe3HUG2l4, where he opines that the web would be great if, instead of URLs, every published document were just named with a hash of its content.


> every published document were just named with a hash of its content.

I see too many issues with this (for example):

- I publish a news article. I publish a retraction/update to said article. Now the article has a new hash. Does the old hash give you the old version of the article, or redirect you to the new version?

- How do we define 'document?' If we define it as the complete HTML page served up to the browser, then changes to the design of the site would invalidate all previous hashes. Pointing old hashes to new hashes is work, which will not always be done (leading to the same situation we have with site redesigns breaking old URLs).


http://ipfs.io lets you to reference content in one of two ways. Either an immutable hash of the content or a reference to the public key that's allowed to publish / update an immutable hash of the content. Seems like a pretty good compromise.


Exactly, you should still be able to have references to persistent identities. Much like the semantics of clojure which has a distinction between values and references to identities like vars/agents etc.

These URLs would be clearly marked of course.


Why not just have all URLs be mutable aliases for hashes?


Why not keep the old document and let the new version (child) refer to the old one (parent)? You then "just" need a refresh feature that can retrieve newer versions of the document for you. In our P2Pedia system (I referred to it in a sibling post earlier) you can go from the parent to its children via search.


We have a P2P file-sharing program that does this, called U-P2P (http://u-p2p.sf.net). Content is hashed, and you use a Gnutella search using the hash to retrieve it. Documents are organized by what we call "communities", which themselves are represented by a document and its corresponding hash. So the document name is really made up of two hashes: the one of the community it belongs to, and its own hash. You can use these hashes as hyperlinks, and U-P2P resolves it via search, as previously mentioned.

What we think is great about it is that the hash is location-independent. There could be multiple copies of the document at various locations at any given point. As long as there is at least one copy and that it is reachable via search, it will be retrieved.

We also built a distributed Wiki based on that idea and platform, called P2Pedia (http://p2pedia.sf.net).

It's all very much an academic research project, so don't expect a beautiful interface or easy-to-install packaging or anything, but I think it's a good proof of concept.

(note to self: we should really move these to GitHub).


In React.js, you can serialize your whole app state through a simple ˋJSON.stringify` and base64 encode that into the url. The nice property of that is that you get to pass that url around to friends, and when they click on it they'll go to the page, which decodes and deserializes the url and reproduce the exact app state, down to the letters in the input boxes.

Effectively, this gives you "program as a value" where the same url means the same program. Immutable programs basically.

I've tried this and the current downside is that it looks extremely ugly when you try to share a link lol. But this should be circumventable. The other downside is that this is a bit theoretical still. You'll have to exclude sensitive information such as password. Sometimes stuff are in a closure rather than in your ˋstate`.


And the other other downside would be when your app becomes big enough to not fit in an url


Emerging standard in that area, subresource integrity: http://w3c.github.io/webappsec/specs/subresourceintegrity/

It's initially just doing the simplest possible thing (making the resource unavailable unless its hash is valid) but semantically it will probably be allowed for the browser to resolve the resource using other methods (e.g. if it already has that resource cached from another URL) so long as the hash matches.


So we could simply set up a url shortening service that published such hashes. Unfortunately with the 'dynamic' nature of web pages these days that's going to be hard to go back to. It may be an interesting way to re-boot the web though. 'regular' Urls are then merely a DNS like layer on top of a content hashing scheme.


this talk about modules as a way to organize similar code makes me wonder- if you had all the functions in a global namespace, you could probably automatically generate some kind of organization by extracting relevant features from each function and doing some kind of clustering. maybe some features could be the function's dependencies, who depends on it, what it returns, its signature, and maybe even nlp in the hope that people are actually using descriptive variable names.


isn't this issue sortof analogous to the expansion/contraction of a language core?

Except in this case the core is user-generated and ever-expanding.

I bet there are a lot of issues in Java history that could predict possible bumps in the road for such a system (since it was essentially concurrently designed by a bunch of actors -- except in that case they were corporate entities)


reminds me of gmail: instead of hierarchical directories ("modules"), just search, and have multiple tags, so an email can be in more than one directory ("metadata").

Seems especially applicable to fp (like erlang), where code reuse is more often of small functions.


I think what you're discussing is really just namespacing ala C++, Java, or .NET. Especially with Java and .NET, you don't import a self-contained module directly from individual source files. The modules are technically all accessible at any time (or at least, the ones linked in to the build, which in the case of the Java and .NET standard libraries is quite a smorgasbord). You just reference the class/function you want in some way: either with using statements or with fully qualified names.

Because, really, if you start throwing everything into one store, you're going to run into the naming conflict issue, and any attempt at addressing the naming conflict issue is going to either look like importing modules or look like namespaces. You either have to explicitly state what your program has access to, or you explicitly state what function you mean when you have access to everything. Realistically, if you give every function a unique name and don't use namespaces, then there will start to be functions called system_event_fire() and game_gun_fire() and disasters_house_fire() and you're right back to having namespaces, just not in name or with a syntax that makes things nice when you know you're dealing with specific things.

Though, it'd be nice if types weren't the only thing that could be placed into a namespace directly in .NET. I'd like to put free functions in there. The Math class in the System namespace only exists because of this. I'd have prefered there to be a System.Math namespace and Cosine and Sine be members of it. Then I could "using System.Math;" and call "Cos(angle)". Instead, I'm stuck in a limbo of half-qualified names.

And I like it. I like it a lot more than Python, Racket, Node.js, etc. and having to import this Thing X from that Module Y. I like the idea that linking modules together is defined at the build level, not at the individual source file level. These languages are supposed to be better for exploratory programming than Java and C#, but actually, you know, doing the exploring part is harder!

Sometimes, I really do just want to blap out the fully qualified name of a function, in place in my code. System.Web.HttpContext.Current.User. If I'm doing something like that, it's a hack, and I know it's a hack, and having the fully qualified name in there, uglying it up, makes clearer that it's a hack. Though, I suppose I'm one of the rare people who actually do go back and clean up my hacks.

EDIT: I thought I wrote more, weird.

The network-accessible database of every library, ever, is definitely a great idea. I think it's where we're heading, with tools like NPM, NuGet, etc. It seems like a natural progression to move the package manager into the compiler (or linker, rather, but that's in the compiler in most languages now). Add in support in an editor to have code completion lists include a search of the package repository and you're there.


I don't know Erlang, so I might be missing something key here.

"I am thinking more and more that if would be nice to have all functions in a key_value database with unique names."

Yeah, sure... Sounds good, right. Until you have naming conflicts.

So then the patch is "oh, let's just add another column to make it more unique", without realizing that you've just, in essence, created a "module" of sorts except it's stored in some sort of giant key/value database.

And then you've come full-circle back to the dilemma the author complains of which is that he doesn't know where to put a function that seems to belong in two modules.

Eventually, I'd say this is a general failing of modules that could potentially solved by some sort of inheritance. Maybe even a tagging mechanism if you really want to be "patch-work joe" about it.


Okay, let's try to not shit all over new ideas here. If we take what Joe-from-2011 means instead of hallucinating him to be incompetent...

Let's re-word "all functions in a database" as "a revision control system in a database."

So, let's make a revision control system. All contents, branches, tags are kept in a database.

Sounds good, right. Until you have naming conflicts.

No, no, no. There are no naming conflicts. The names humans will use are just pointers to the most recently updated underlying contents. The _actual_ names are garbage hash identifiers. The _usable_ names are human names bound to underlying contents.

So, if master is commit A and you make commit B, there is no naming conflict on the name "master," you just re-point it to commit B.

a function that seems to belong in two modules.

That's the problem with explicit hierarchy and why the world now runs on tagging-based crowdsourced folksonomies.


"A revision control system in a database"? A revision control system is a database. I think what you and the author are trying to get at is some sort of "docker, but for functions" type of thing. And we all know what a mess that is when it comes to public docker images.

"No, no, no. There are no naming conflicts. The names humans will use are just pointers to the most recently updated underlying contents. The _actual_ names are garbage hash identifiers. The _usable_ names are human names bound to underlying contents. So, if master is commit A and you make commit B, there is no naming conflict on the name "master," you just re-point it to commit B."

"Master" is the actual name that is going to be conflicting, if I understand your example.


> "Master" is the actual name that is going to be conflicting...

Yes, but... Just like in git, "master" points to one and only one commit, but the pointed-to commit might be different in the future. The name of each commit is a guaranteed-unique-for-the-forseeable-future hash of the contents of each commit. That name never, ever changes.

If you want a particular commit, you use its unique, non-human-friendly name. If you just want a particular branch and don't care too much about any particular commit, you use its collision-prone human-friendly name. Naming of branches and/or projects is still going to be "hard". Naming of particular releases/versions of code is not.


>he doesn't know where to put a function that seems to belong in two modules.

In a third module, where else?


dibs on create_uuid_v4!!




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: