Sunday, April 12, 2015

And another rant, on how to isolate spark/storm dependencies from my own dependencies

In my last rant i talked about classloaders and protecting the dependencies used by my code from overlapping with the dependencies used by the container, that runs my code. The only contact between them should be through interfaces/annotations visible to my code.

In the following rant i write about functions and predicates. And this relates again to classloading issues.

I have a huge amount of geographical data, that i want to reduce to data for a single country. And i want to do it in spark or storm.

  • i do not know in advance, by what country the data will be filtered with. Its a parameter.
  • i can not ask a remote service for each datapoint, in which country it is, as it is too slow
Normally i would create a package, that contains the shapes for all countries. And then there is a service, that takes a country as a parameter and returns a stateless, immutable, thread-safe predicate, that can be applied to any coordinate from my data.
  • each shape maps to one country,
  • i need a library, that uncompresses the shape map,
  • i need a library, that loads the shapes, creates an index, and offers a function, that returns the matching shape for the coordinates and implicitly with it the country
The shape map is rather big. Heavily compressed it has the size of almost 10MB. It has to be distributed to any node, that runs the job. And then it has to be unpacked, prepared and so on. 

To reduce the size, i can offer a remote service, that prepares the shape map, so only the ones needed for the particular country is returned. This service is called before my job runs and can then be safely distributed with the job.

I can also unpack everything in the driver, that distributes everything. So the nodes do not have that work.

For uncompressing i can use Zip, that is shipped with the Java SDK. 

There are geo tools out there, that allow me to create indexes and contain/intersect functions. It has to come with my package. But the one i know does not support shipping indexes via serialization. So i assume, i have to prepare everything on the actual node.

This looks like a straightforward solution.

But again i have dependencies, that i can not control. And they also bring in their own dependencies. For example, one depends on Guava. And storm and spark, they also use Guava, but use a different version. So the chance is there, that code is called, that needs a specific feature from Guava. But if an older version is seen in the classpath, that does not have this feature implemented yet, that code will break. The same is true the other way around, when newer versions of guava remove deprecated functionality. I have seen this for the StopWatch in Guava.

I really would love to have my dependencies separated. I do not want to align my code to dependencies in control of other people. It restricts me.

What can i do?

Usually the container creates a classpath, that puts its own dependencies and my dependencies into one classpath and then calls the entry point, e.g. the main method. To protect me and the container, i have to hide my own dependencies and only some entry code is visible.

So, when my entry code is called, it is aligned it to the dependencies of the container. But when my entry code actually calls my functionality, then i need to have a different classloader, where only interfaces/annotations from the container are visible (In spark, it would be JavaRDD, e.g), and everything else comes from my own dependencies.

What i need to submit to spark and storm, are actually two archives then. 
One that contains the entry code, that is called by spark/storm plus a little classloader magic, that i own and has no overlaps with spark/storm dependencies. Or existing classloader magic libraries, where i know, that there are no overlaps. 
And the other archive contains my own dependencies, handled by classloader magic, with its own entry point. The code in there is compiled against spark/storm API, but the API itself is not contained. It is provided by the container.

Maybe one can re-use WebArchives (WAR). And a little classloader magic library, that provides the bridge to look into the WAR and calls the entry code and also provides classloader magic like in the good old servlet world.

Or a fat jar, that contains the entry code plus the classloader magic library plus the jar, that contains my dependencies.

Does that make sense?

Friday, February 20, 2015

Thoughts on spark, storm, and classpath isolation

Currently i use apache-spark and apache-storm. And in both projects i always end up with a library mess. They depend on libraries, which i use in my code as well. But mostly in a newer version. For example, i use google's latest guava. Everyone else in the java world uses guava, but most likely in a different version. And guava constantly adds, deprecates and then removes functionality. So it happens, that when i deploy my code to storm or spark, i might break code. On my side or on their side, if i am not careful.

I do not want to go into further detail. It just causes a lot of pain. And the best way to stay out of trouble is to stick to the versions, that storm and spark use. And avoid latest versions. :-(

I originally come from the servlet container world, where this problem is solved. A servlet container runs web applications. And the dependencies of both, the servlet container and the web application, are isolated from each other. The servlet container only makes those dependencies visible to the web application, that are required by the servlet specification, e.g. the servlet API.

In the future i would assume, that spark or storm go a similar way. They run user code with a custom classloader and only the spark or storm API is visible to the user code, to program against it.

By the way, a simple workaround at the moment is to change the classpath manually. IMHO spark has an option to say explicitly: put user dependencies first on the classpath. And in storm, one can rewrite the shell script to put the user dependencies first.

But those solutions has a drawback. It might break storm's or spark's own dependencies.
Also it assumes that the classloader has a defined order on how to find classes: From left to right in the classpath. But I think that assumption is incorrect. There might be a classloader, that scans the classpath in parallel. In that case, the mess might become even worse, if there are two versions of guava in the classpath.

(This topic actually touches something, that i am interested in for quite some time. The virtual and definitly the real world is constantly changing and with it the libraries and the interfaces, that we use. One can barely rely on stability. And one would have to constantly recompile the world against old interfaces and new interfaces and so on to see, if progress is possible without breaking anything. 
This tickles my mind a lot. On the one hand i love to see order, stability, predictability, reproducibility. I like to conservate. On the other hand, that is a very static look onto the virtual world, where we break down things all the time, replace them or build them differently.)

Saturday, December 7, 2013

det sind imma die anderen, nie icke

boah, ich bin so ein langweiler. aber wenigstens nicht so lahm wie die leute, die nur noergeln, dass die da oben doch alles verbrecher und ueberhaupt an allem schuld sind. die in ihrer wohlgepuderten sitzblase hocken. dies warm haben. die energie, die ich dann beim aufregen verbrate, hach, wenn ich die jetzt noch sinnvoll jenutzt krieg, ick waer nen produktivmonster und nicht so'n schnarchfisch.

jibs eigentlich schnarchende fische? blubberblasen anna fischbadetankoberflaeche?

schnell noch den hintern pudern.

Monday, October 21, 2013

Warum Artikel über Facebook krankmachen

Soeben gelesen: Die Bilder und die Leere (Warum Facebook unglücklich macht) Auf einer Webseite, die mit Overlay und Hockeystick offensiv die Werbung ins Auge treibt. So über sieben Ecken und gläsern und so, damit der schnöde Mammon rollt. Egal.

Der Artikel provoziert mich. Ich stimme zu, Facebook drückt ordentlich in die Wunde meiner eigenen Einsamkeit hinein. Ich sehe Leute, wie sie die das ganze Jahr umherreisen und fotografieren. Wie ihre Familien wachsen. Sie in Clubs und Bars gehen, Parties feiern und, und, und. Ja, ich male mir ein Bild, wie schön sie es haben. Und ich bin kein Teil davon.

Umgekehrt sehe ich meine Fotos, in denen ich schüchtern ne liebe Freundin um die Hüfte packe, vor den übergrossen Triebwerken eines Space Shuttles. Oder Fotos einer Landschaft in Montenegro, wo ich das Auto mit Absicht mit viel zu hohem Tempo an die Grenze treibe und daneben ist der Abgrund.

Was ich an Büchern von Philippe Djian mochte, ist das gewaltige Leben darin. Aufgewühlt, energiegeladen, wütend, Faust zusammengedrückt ging ich dann manchmal aus dem Haus. Jetzt werd ichs der Welt zeigen. Und abends esse ich dann doch wieder nur eine leckere Butterbrotstulle. Ne Wiener dazu.

In Geschichten von Sibylle Berg sah ich oft, wie die Träumereien, wie schön das doch alles wäre, ja wenn ich doch nur den oder die ganz alleine für mich hätte, da ein Haus einsam im Grünen, am See, mit Wald dran, nachts nackt im See badend, der Mond romantisch Schatten werfend, ja hach, grausam von der Realität in Stücke gerissen werden.

Ich bin doch nicht blöd. Es ist immer noch eine Welt. Das ist nicht nur Facebook. Jeden Tag komme ich wieder zurück auf den verfickten Boden der Realität. Und werde aus meinen Träumereien gerissen. Ich muss einkaufen. Klo putzen. Der Wichsgriffel schmerzt. Leute sterben. Andere schreiben mir auch so keine Mail. Streitereien. Das Geld reicht nicht aus für die Suchtmittel. Ich fühle mich auch mit Facebook einsam.

Warum sollte mir da kein Licht aufgehen, dass die anderen Lebeleute auch in keiner rosafarbenen Welt leben?

Bei irgendeinem Lied von Pink Floyd gabs die Zeile "All we need to do is make sure we keep talking". Und es ist völlig Banane, so finde ich, wie ich es tue. Ob über Bücher, Briefe, Facebook, Zwitscherflatter, Pinkelkreise im Schnee. 

Diese stetige "es macht krank" bedrängt mich mit seinem "gesunden" Ideal. Dabei gab es schon immer Leute, die irgendwas übertreiben, andere die es untertreiben und manche, na, die können es mit gar niemanden treiben.

(Und so, liebe NZZ, dass ihr Werbung schaltet und gezielt ausliefern möchtet, da lässt sich sicher auch euch vorwerfen, dass ich von solchen Artikeln gezielt eingeschüchtert werde, so dass ich jetzt wieder mehr bei euch lese, um zum Normalen dazuzugehören. Und durch euer Tracking dann zum gläsernen und damit zum werberelevanten Bürger werde, den ihr euch dann mit Klickpreis bezahlen lasst.)

Friday, September 4, 2009

Meine Kommando-Top10

1974 cd
1709 ls
1319 ssh
565 grep
393 less
364 ping
315 cvs
286 vim
206 ./
196 rm

Bei 10000 Eintraegen im History-File. Naja. Bin ueberrascht, wie oft "ping" vorkommt. Kein gutes Zeichen. :-)