Sep
Why Cloud Computing Is Here To Stay
by Janico in technical
Occasionally, you hear the argument that cloud computing is the latest incarnation of a trend that comes and goes throughout the history of computing. It used to be called time-sharing in the early days, and client-server during the 80ies and 90ies. Depending on the relative cost of networking versus processing, computing tended to be either centralized or decentralized. The reasoning is that we are currently in a phase where processing is cheaper than networking, so that cloud computing is economically reasonable. As a consequence, we might see another shift in these numbers so that we would see a tendency towards decentralization.
There is a flaw in this logic, however: networking and computing are not the only factors in this equation. The costs of operating an IT system must be considered as well. And while networking and computing became cheaper and cheaper over time, the operational costs are staying the same or even rising, because the systems are becoming increasingly complex.

(I came across this insight in a talk by security-guru Bruce Schneier)
Aug
Identifying Names with OpenNLP and JRuby
by Janico in programming
When trying to make sense of text, nouns are a good place to start, as they usually convey the subject of the text. To be more precise, when analyzing text to find out, what it’s about, we want to identify among other things people, organizations, and locations — also known as named entities.
In this post, I take a look at how to find named entities using OpenNLP and JRuby. OpenNLP is a collection of tools for natural language processing written in Java. In addition to named entity identification, OpenNLP includes tools for sentence detection, part-of-speech-tagging, and more. JRuby makes it possible to use Java libraries from Ruby code.
When the names have been identified, I want to tag them using an ad-hoc HTML
Microformat where the names get
enclosed an a <span>-tag that has a CSS-class corresponding to the type of
named entity. For example, the text “Mrs. Schmidt drives to Copenhagen on
Monday.” should be processed into “Mrs. <span class="person">Schmidt</span>
drives to <span class="location">Copenhagen</span> on <span class="date">Monday</span>.“
Dependencies
In order to follow along with the examples here, you need a working JRuby installation and the following libraries and data files.
Libraries
- OpenNLP Tools: The basic tools which include the named entity detection;
- OpenNLP Maxent: a machine learning package required for loading the language models;
- GNU Trove: a library of java collections that are needed by OpenNLP.
Model Data
The following models are used by OpenNLP to identify sentences as named entities.
Getting Started
First, we need to load all the dependencies into our application or interactive Ruby session.
require 'java'
require File.join(File.dirname(__FILE__), '../deps/opennlp-tools-1.4.3.jar')
require File.join(File.dirname(__FILE__), '../deps/maxent-2.5.2.jar')
require File.join(File.dirname(__FILE__), '../deps/trove-2.1.0.jar')
java_import Java::opennlp.tools.namefind.NameFinderME
java_import Java::opennlp.tools.tokenize.SimpleTokenizer
java_import Java::opennlp.maxent.io.BinaryGISModelReader
java_import Java::opennlp.tools.lang.english.SentenceDetector
java_import Java::opennlp.tools.util.Span
By requiring the ‘java’ package, we tell JRuby to make the Java
interoperability available. Require lets us load jar files in addition to is
primary function of loading ruby code. When we want to use Java classes
directly in our current namespace, we can use the java_import method.
Next, I initialize the objects that will do the work.
@tokenizer = SimpleTokenizer.new
@detector = SentenceDetector.new(File.expand_path("models/EnglishSD.bin.gz"))
@finders = %w{person location date organization}.map do |model|
[model, NameFinderME.new(BinaryGISModelReader.new(java.io.File.new("models/#{model}.bin.gz")).getModel)]
end
The tokenizer is needed to split the text into words which are passed to the entity identifier. The sentence detector needs to load a model for identifying a sentence in a given language. This is necessary, because it needs to be aware of things like abbreviations that have periods behind then, otherwise it would take “Mrs.” in the example above as a sentence on its own. The finders are responsible for the actual identification. We need one of these for each of the models to apply. Here, I want to identify persons, locations, dates, and organizations. The loading of the models is relatively expensive, so you will probably want to do that only once in a real application and hold on to the created objects.
Now that we are done with initializations, we can finally begin to do some text processing.
sentences = @detector.sentDetect(text)
sentence_positions = [0] + @detector.sentPosDetect(text)
Here, I split the text into sentences. It is generally recommended to use the
entity identification on sentences rather than full documents, as it avoids
mistakenly detecting names across sentence boundaries and also improves the
accuracy. In addition to finding the sentences as a list of strings, I need a
list of the positions where the sentences start so that I can insert the tags
into the text later. I prepend position 0 to the list, because sentPosDetect
only returns the positions where a sentence is finished, but I also need the
beginning of the first sentence.
annotations = []
sentences.each_with_index do |sentence, sentence_idx|
token_positions = @tokenizer.tokenizePos(sentence)
tokens = Span.spansToStrings(token_positions, sentence)
@finders.each do |model, finder|
names = finder.find(tokens)
annotations += names.map do |name|
start_pos = token_positions[name.getStart].getStart + sentence_positions[sentence_idx]
end_pos = token_positions[name.getEnd-1].getEnd + sentence_positions[sentence_idx]
Struct::Annotation.new(model, Range.new(start_pos, end_pos))
end
end
end
I iterate over all sentences, breaking the sentence into tokens. Again, I need
the tokens as strings and also their positions for inserting the tags. The
search for named entities is done by applying each finder to provide me with a
list of names. These names are Span objects which give the indexes of the
tokens where a detected entity starts and ends. The results are gathered in
the annotations list, where I put Annotation structures. The annotations
contain the name of the model that was matched and a range of the position
where the entity starts in the original text calculated based on the sentence
and token offsets.
When all the annotations have been gathered, we can apply the tags to the original text.
@formatter.apply_annotations(text, annotations)
The formatter applies the annotations by putting <span> tags around the
named entities.
I put this code into a small library that you can find on GitHub.
Aug
Web vs. Desktop Applications
by Janico in programming
The web seems to be winning against desktop applications. Not everybody likes this, of course, and it’s always hard to predict the future, but it seems unlikely that the local computer (or should I say device?) will be where the action is in the future.
Some declare the game to be already over, but that view is not shared by everyone. The future is neither easy to predict, nor is it evenly distributed. Here is an overview of the different opinions towards this topic.
Direct Opposition
Microsoft engineer Michael Braude argues that he will never be a “web guy”, because he perceives web programming as fundamentally inferior to “real” programming on the desktop. This technical inferiority attracts inferior programmers who he does not want to work with. He even goes so far as to say that web programming is making us stupid.
Michael seems to ignore that web programming not only consists of the front-end in the browser, but that there is also a back-end where you can do most of that stuff that he finds interesting: writing to disk, spawning threads, using strong typing, and so on. Apart from that, he sees two main problems in developing web front-ends: browser incompatibilities and lack of tools.
If you are only working in a walled garden, you don’t have to deal with incompatibilities. But building true cross-platform desktop applications is not exactly easy. (By “true” cross-platform I mean apps that really integrate into the look-and-feed of the OS and not some Java Swing stuff.) Also, the compatibility situation is improving. The vendors — maybe except Michael’s employer — seem to be really working on good standards support.
The tools-support for web-development might not be as seamless for some desktop-environments, but it’s getting better. Firebug, for example, is an awesome tool for debugging both CSS and JavaScript.
Neo-Desktopism
Another theory what the future may look like is Neo-Desktopism (although that term doesn’t seem to have caught on).
Neo-Desktopism is the belief that the web browser as an end user facing application platform is ultimately an evolutionary cul-de-sac. The goal of Neo-Desktopism is to evolve traditional desktop application technologies [...] to a point where they can float free of a physical local client installation, deploying on demand just like web pages.
This seems to happen more on mobile devices than on desktops. Sure, there are technologies like Java Web Start, Adobe AIR, or Microsoft Silverlight, but do they actually matter? Even Microsoft’s own Office Web Apps are apparently build with HTML and JavaScript, with Silverlight only providing some optimizations.
Don’t underestimate the web
Many of the criticisms of web-based applications argue that you cannot provide the same usability as on the desktop. In his keynote at Google IO Vic Gundotra tells an interesting anecdote about his time at Microsoft. They thought that web apps could never rival desktop apps and used map software as their standard example for usability that could not be reproduced inside a browser. Then along came Google Maps.
Jeff Atwood tells a similar story — incidentally also about maps — where he noticed that his wife used Google Maps instead of Streets and Trips that they had used for a long time. The surprising reason was better performance. Jeff goes on to note that Google Maps does not only provide the same functionality, it is even easier to use. He concludes that desktop applications are dead, and that “all the innovation in user interface seems to be taking place on the web, and desktop applications just aren’t keeping up.”
Indeed, there is some innovation going on in web-based UIs. The demo of Wave is an impressive example of what is possible with the HTML5. Bespin is an experiment from Mozilla Labs trying to create a web-based code editor. While this is far from complete, you can already use it to play around and see what is actually possible with modern browsers.
But the advances in usability do not only come from more powerful browsers and web applications emulating the features of desktop apps. There is also the trend of embracing the constraints of the environment to create software that is easier to use than the competition from the desktop. 37signals is a compony that championed this approach with their project management software Basecamp.
Who cares about applications, anyway?
The debate of web vs. desktop however focuses on how the things we know as applications today will be done in the future: editors, spreadsheets, photo editing, and so on. These applications won’t go away, but more important for shaping the future are the things we never did on the desktop. We usually don’t think of wikipedia, twitter, google search, etc. as applications, although they are obviously powered by applications.
One of the key points in Tim O’Reilly’s original article about Web 2.0 was “Data is the Intel Inside”. Software is more pervasive than ever, but it gets relegated to an infrastructure role. Like plumbing, software is everywhere, but it’s not noticed unless is fails. When using a forum, for example, do you care, if it uses phpBB or YaBB, or whatever? What matters is, if there is interesting content and interesting people. This data-oriented thinking does not apply equally to everything, but it is important to keep it in mind.
There are valid reasons for concern about a future where most applications are web-based: Security, privacy, ownership of data, but nobody cares how software is written as long as it works.
As programmers, we have a certain influence on the stuff that we write, but we don’t have much influence on software that we do not write. If you are against web programming, what do you do to stop that particular future from becoming more evenly distributed?
Jul
Meta is Progress
by Janico in Uncategorized
Jeff Atwood doesn’t like meta.
Meta-work becomes a reflex, a habit, an addiction, and ultimately a replacement for real productive work. It’s something I think everyone should watch out for, whatever walk of life or career you happen to have. In fact, I’ve come up with a zingy little catch phrase to help people remind themselves, and their coworkers, how toxic this stuff can be — meta is murder.While Jeff lists quite a few compelling examples of unproductive meta-work, I would argue that successful meta-work is the driver for leaps in productivity. In fact, I’ve come up with a zingy little catch phrase to help people remind themselves, and their coworkers, that stepping out of what they’re currently doing and thinking about how to improve it is how change happens – meta is progress.
For example, most software is the result of meta-work for its underlying problem: you could just go and do the work by hand. Instead of writing blogging tools, people could just have edited HTML and put it on their web servers. But some folks did the meta-work of coding Wordpress, Blogger, Movable Type, etc. and made life easier not only for themselves, but also for me, for Jeff, and for countless others.
When you solve the meta-problem really well, the underlying problem becomes trivial (e.g. using computers for calculating) or at least becomes easier by orders of magnitude, so that you can take it to levels that seemed unrealistic before (e.g. programming in high-level programming languages).
I agree that many (possibly most) meta-discussions are fruitless. It is hard to see how debating the politics of a podcasting gear site is going to lead to significant advances in podcasting technology. Sometimes meta-work is indeed about procrastinating and not about actually changing something, but I wouldn’t condemn all things meta as toxic because of that. As with drugs, the dosage (i.e. how many people spend how much time on it) and how it is applied (what problems are we talking about), determines if it’s a cure or a poison.
How much meta can you take?

"Drawing Hands" by M.C. Escher
Apr
Think – Read – Think Again
by Janico in research, scholarship
Conventional wisdom tells us that when starting with a new topic, we should first research what others had to say about this, and only later do our own thinking.
Six months in the lab can save you a day in the library– Albert Migliori
However, in my experience it is a lot harder to appreciate the solutions that others have found, when you didn’t try to solve the problem yourself.
The obvious counterargument is that reading the solution right away is more efficient, and you don’t have the time to do all the thinking yourself. I would argue that you can save even more time when you only read the introduction and the conclusions but not the details. If you weren’t deeply involved with the problem before, you would only remember the gist anyway. Come back for the details when you really need them and have tried to come up with your own solution.
S. Keshav suggested a similar process for reading research papers. I think this should be applied to anything with deep technical content such as book chapters or detailed blog posts.
As for the “think again” part from the title of this post… I guess that is obvious.
Apr
Who’s Afraid of Targeted Ads?
by Janico in Uncategorized
When trying to explain why it is dangerous to be careless with personal data, many people (including myself) often use the argument that evil corporations will use that data to send or show you advertisements tailored to your interests. Is that really an example of bad things that result from privacy breaches? Is it even a valid argument that you want to avoid getting personalized ads?
I know that this argument has never helped me convince anyone of the value of privacy. More likely has it damaged my credibility, because nobody saw the problem and concluded that my standpoint was pitifully weak.
Personalized ads are not evil per se. Ads are annoying, that’s for sure, but they are annoying because they interrupt you, not because of their contents. In fact, the ads annoying me the most are so not tailored to my interests. In theory, more targeted ads should even reduce the overall volume of advertising, as targeted ads are assumed to be more effective. (As an aside however, I like Seth Godin’s idea of permission marketing a lot better. In the long run this would be an advantage for both customers and those who want to sell something.)
Targeted ads seem creepy to privacy proponents, because they are a symptom of someone knowing more about you than you thought. Don’t get me wrong, I’m not saying that targeted advertising is a good thing. But the problem are not the ads themselves.
Privacy is a difficult subject. Here are some good arguments by Bruce Schneier and Daniel Solove. Arguing in favor of privacy is rather unpopular, we should not weaken our point of view even further by weak arguments.
Mar
Presentation at the Networking Colloquium in Bremen
by Janico in research
Yesterday, I gave a presentation about the status of my dissertation project at the colloquium of the networking group at the University of Bremen. These are the slides I used.
Feb
Presentation on DTN Publish/Subscribe in Dagstuhl
by Janico in research
Last week I attended the second seminar on Delay and Disruption-Tolerant Networking at Schloss Dagstuhl. Together with Dirk Kutscher I gave a presentation on our current research on Publish/Subscribe Multicasting in DTNs.
The abstract of the talk was:
We discuss the problem of controlling resource usage for multicast content distribution in DTNs. Starting from epidemic routing as a bottom-line, we evaluate different trade-offs for the key-metrics reliability (delivery ratio), immediacy (delay), and resource consumption (usage of persistent storage and links). Based on preliminary simulation results, we show what effect different choices for prioritization, filtering, and propagation of subscriptions (i.e. group membership information) have on the key-metrics.The slides are here.With the talk we hope to initiate a discussion on how multicast can be used effectively in resource-constrained environments.
Feb
RDTN Simulations in the Cloud
by Janico in technical
To perform simulations with RDTN more efficiently, I added support for running them on Amazon’s Clould Computing services (called Amazon Web Services or AWS).
As the main point of the experiments I do with RDTN is to run a lot of simulations with varying parameters, parallelization is trivial: Just start a couple of processes that handle different parameters.
Thanks to the ongoing trend towards utility computing, putting the simulation on a cluster is now not only feasible but even easy. Simulations are an ideal application for this this style of accessing resources where you pay only for what you use, as they use as much CPU power as they can when they run, but a dedicated machine would idle most of the time, as simulations are no continually running service.

Ingredients
The goal was to implement plumbing around RDTN and the Amazon WebServices so that you can run simulations on a given number of machines just by typing one command. The components used for this are:
- S3: The Simple Storage Service allows you to store data on Amazon’s infrastructure. I use it to store the results of the simulations.
- EC2: The Elastic Compute Cloud provides the computing resources for running the simulations. The general idea is that you have the image of the system (including the operating system and all necessary daemons and applications) you want to run and then tell EC2 to start it as a virtual machine — Amazon calls them instances — on a real machine somewhere in Amazon’s datacenters. The usage of EC2 is billed in units of machine hours — you pay only for the time your instances are running. A nice feature of EC2 is that you can create the images from a running system, so that you can start out with a basic system (I used a bare Debian GNU/Linux), start it, log in via ssh, and then configure it and install the software you need. When it works, you bundle the system into an image and store it in S3, from where it can be loaded.
- SQS: I use the Simple Queue Service to manage the simulation tasks. The variants that are to be simulated are computed in advance and stored in SQS. When the instances start they take the next variant from the queue and work on it.
- GitHub: In order to avoid updating the images when ever the code for the simulations changes, the instances pull the latest version of the awssim branch of RDTN from GitHub.
- RDTN obviously.
- AWS-S3 for Ruby: To access S3 from RDTN, I use Marcel Molina’s aws-s3 which I forked to fix some issues with Ruby1.9 (I use 1.9, because it runs the simulator about three times faster than 1.8).
- SQS sample code for Ruby: The code I use to interact with SQS is based on sample code from Amazon’s developer resources. I also needed to tweak this to work with Ruby1.9. Currently, the code lives in the RDTN repository, but I plan to extract it to it’s own, as it may be useful beyond the scope of RDTN.
What it does
- From any machine compute the variants that should be simulated and put them into SQS. Each variant is one entry in the queue.
- Start some EC2 instances
- Each instance pulls the current version of RDTN from GitHub. The simulations always use the awssim branch.
- On each instance, the simulation runner is started, getting the next variant from the queue to simulate it.
- The results of a simulation run are written to S3.
- The simulation runner continues to get variants from SQS (step 4) until the queue is empty and the instance automatically shuts down.
- The results can be pulled from S3 to any machine where the data is analyzed.
Future Work
The part of the processing of result data could be performed on the EC2 instances, so that only the synthesis of the results needs to be performed on a single machine.Jan
WordPress: Backup and Upgrade to 2.7
by Janico in technical

- Wordpress 2.7 – Dashboard
It is much more concise than the old version which wasted a lot of space with a large header. You can even move around the parts on the dashboard or disable them completely.
The post writing interface also looks cleaner than before. What I like most about this is that the text entry that is the most important part of this interface — the epicenter — is now centered on the page, while in 2.6 I had to scroll down a bit in order to see it in the middle of my screen. I would have never though how important this is, had I only looked at the page without using it. But in practice I was slightly annoyed with it whenever I wrote a post.

Wordpress 2.7 - Post


