Sunday, April 12, 2015

And another rant, on how to isolate spark/storm dependencies from my own dependencies

In my last rant i talked about classloaders and protecting the dependencies used by my code from overlapping with the dependencies used by the container, that runs my code. The only contact between them should be through interfaces/annotations visible to my code.

In the following rant i write about functions and predicates. And this relates again to classloading issues.

I have a huge amount of geographical data, that i want to reduce to data for a single country. And i want to do it in spark or storm.

  • i do not know in advance, by what country the data will be filtered with. Its a parameter.
  • i can not ask a remote service for each datapoint, in which country it is, as it is too slow
Normally i would create a package, that contains the shapes for all countries. And then there is a service, that takes a country as a parameter and returns a stateless, immutable, thread-safe predicate, that can be applied to any coordinate from my data.
  • each shape maps to one country,
  • i need a library, that uncompresses the shape map,
  • i need a library, that loads the shapes, creates an index, and offers a function, that returns the matching shape for the coordinates and implicitly with it the country
The shape map is rather big. Heavily compressed it has the size of almost 10MB. It has to be distributed to any node, that runs the job. And then it has to be unpacked, prepared and so on. 

To reduce the size, i can offer a remote service, that prepares the shape map, so only the ones needed for the particular country is returned. This service is called before my job runs and can then be safely distributed with the job.

I can also unpack everything in the driver, that distributes everything. So the nodes do not have that work.

For uncompressing i can use Zip, that is shipped with the Java SDK. 

There are geo tools out there, that allow me to create indexes and contain/intersect functions. It has to come with my package. But the one i know does not support shipping indexes via serialization. So i assume, i have to prepare everything on the actual node.

This looks like a straightforward solution.

But again i have dependencies, that i can not control. And they also bring in their own dependencies. For example, one depends on Guava. And storm and spark, they also use Guava, but use a different version. So the chance is there, that code is called, that needs a specific feature from Guava. But if an older version is seen in the classpath, that does not have this feature implemented yet, that code will break. The same is true the other way around, when newer versions of guava remove deprecated functionality. I have seen this for the StopWatch in Guava.

I really would love to have my dependencies separated. I do not want to align my code to dependencies in control of other people. It restricts me.

What can i do?

Usually the container creates a classpath, that puts its own dependencies and my dependencies into one classpath and then calls the entry point, e.g. the main method. To protect me and the container, i have to hide my own dependencies and only some entry code is visible.

So, when my entry code is called, it is aligned it to the dependencies of the container. But when my entry code actually calls my functionality, then i need to have a different classloader, where only interfaces/annotations from the container are visible (In spark, it would be JavaRDD, e.g), and everything else comes from my own dependencies.

What i need to submit to spark and storm, are actually two archives then. 
One that contains the entry code, that is called by spark/storm plus a little classloader magic, that i own and has no overlaps with spark/storm dependencies. Or existing classloader magic libraries, where i know, that there are no overlaps. 
And the other archive contains my own dependencies, handled by classloader magic, with its own entry point. The code in there is compiled against spark/storm API, but the API itself is not contained. It is provided by the container.

Maybe one can re-use WebArchives (WAR). And a little classloader magic library, that provides the bridge to look into the WAR and calls the entry code and also provides classloader magic like in the good old servlet world.

Or a fat jar, that contains the entry code plus the classloader magic library plus the jar, that contains my dependencies.

Does that make sense?

Friday, February 20, 2015

Thoughts on spark, storm, and classpath isolation

Currently i use apache-spark and apache-storm. And in both projects i always end up with a library mess. They depend on libraries, which i use in my code as well. But mostly in a newer version. For example, i use google's latest guava. Everyone else in the java world uses guava, but most likely in a different version. And guava constantly adds, deprecates and then removes functionality. So it happens, that when i deploy my code to storm or spark, i might break code. On my side or on their side, if i am not careful.

I do not want to go into further detail. It just causes a lot of pain. And the best way to stay out of trouble is to stick to the versions, that storm and spark use. And avoid latest versions. :-(

I originally come from the servlet container world, where this problem is solved. A servlet container runs web applications. And the dependencies of both, the servlet container and the web application, are isolated from each other. The servlet container only makes those dependencies visible to the web application, that are required by the servlet specification, e.g. the servlet API.

In the future i would assume, that spark or storm go a similar way. They run user code with a custom classloader and only the spark or storm API is visible to the user code, to program against it.

By the way, a simple workaround at the moment is to change the classpath manually. IMHO spark has an option to say explicitly: put user dependencies first on the classpath. And in storm, one can rewrite the shell script to put the user dependencies first.

But those solutions has a drawback. It might break storm's or spark's own dependencies.
Also it assumes that the classloader has a defined order on how to find classes: From left to right in the classpath. But I think that assumption is incorrect. There might be a classloader, that scans the classpath in parallel. In that case, the mess might become even worse, if there are two versions of guava in the classpath.

(This topic actually touches something, that i am interested in for quite some time. The virtual and definitly the real world is constantly changing and with it the libraries and the interfaces, that we use. One can barely rely on stability. And one would have to constantly recompile the world against old interfaces and new interfaces and so on to see, if progress is possible without breaking anything. 
This tickles my mind a lot. On the one hand i love to see order, stability, predictability, reproducibility. I like to conservate. On the other hand, that is a very static look onto the virtual world, where we break down things all the time, replace them or build them differently.)