Sunday, April 12, 2015

And another rant, on how to isolate spark/storm dependencies from my own dependencies

In my last rant i talked about classloaders and protecting the dependencies used by my code from overlapping with the dependencies used by the container, that runs my code. The only contact between them should be through interfaces/annotations visible to my code.

In the following rant i write about functions and predicates. And this relates again to classloading issues.

I have a huge amount of geographical data, that i want to reduce to data for a single country. And i want to do it in spark or storm.

  • i do not know in advance, by what country the data will be filtered with. Its a parameter.
  • i can not ask a remote service for each datapoint, in which country it is, as it is too slow
Normally i would create a package, that contains the shapes for all countries. And then there is a service, that takes a country as a parameter and returns a stateless, immutable, thread-safe predicate, that can be applied to any coordinate from my data.
  • each shape maps to one country,
  • i need a library, that uncompresses the shape map,
  • i need a library, that loads the shapes, creates an index, and offers a function, that returns the matching shape for the coordinates and implicitly with it the country
The shape map is rather big. Heavily compressed it has the size of almost 10MB. It has to be distributed to any node, that runs the job. And then it has to be unpacked, prepared and so on. 

To reduce the size, i can offer a remote service, that prepares the shape map, so only the ones needed for the particular country is returned. This service is called before my job runs and can then be safely distributed with the job.

I can also unpack everything in the driver, that distributes everything. So the nodes do not have that work.

For uncompressing i can use Zip, that is shipped with the Java SDK. 

There are geo tools out there, that allow me to create indexes and contain/intersect functions. It has to come with my package. But the one i know does not support shipping indexes via serialization. So i assume, i have to prepare everything on the actual node.

This looks like a straightforward solution.

But again i have dependencies, that i can not control. And they also bring in their own dependencies. For example, one depends on Guava. And storm and spark, they also use Guava, but use a different version. So the chance is there, that code is called, that needs a specific feature from Guava. But if an older version is seen in the classpath, that does not have this feature implemented yet, that code will break. The same is true the other way around, when newer versions of guava remove deprecated functionality. I have seen this for the StopWatch in Guava.

I really would love to have my dependencies separated. I do not want to align my code to dependencies in control of other people. It restricts me.

What can i do?

Usually the container creates a classpath, that puts its own dependencies and my dependencies into one classpath and then calls the entry point, e.g. the main method. To protect me and the container, i have to hide my own dependencies and only some entry code is visible.

So, when my entry code is called, it is aligned it to the dependencies of the container. But when my entry code actually calls my functionality, then i need to have a different classloader, where only interfaces/annotations from the container are visible (In spark, it would be JavaRDD, e.g), and everything else comes from my own dependencies.

What i need to submit to spark and storm, are actually two archives then. 
One that contains the entry code, that is called by spark/storm plus a little classloader magic, that i own and has no overlaps with spark/storm dependencies. Or existing classloader magic libraries, where i know, that there are no overlaps. 
And the other archive contains my own dependencies, handled by classloader magic, with its own entry point. The code in there is compiled against spark/storm API, but the API itself is not contained. It is provided by the container.

Maybe one can re-use WebArchives (WAR). And a little classloader magic library, that provides the bridge to look into the WAR and calls the entry code and also provides classloader magic like in the good old servlet world.

Or a fat jar, that contains the entry code plus the classloader magic library plus the jar, that contains my dependencies.

Does that make sense?

No comments: