The problem is: given a list of tasks that you want to solve in parallel, you cannot know ahead of time which jobs use the same data. You have to start solving them to figure that out. The problem isn't distributing the work, it's knowing when it's safe to do two things in parallel.