When trying to make sense of text, nouns are a good place to start, as they
usually convey the subject of the text. To be more precise, when analyzing text
to find out, what it’s about, we want to identify among other things people,
organizations, and locations — also known as named entities.
In this post, I take a look at how to find named entities using OpenNLP
and JRuby. OpenNLP is a collection of tools for natural language
processing written in Java. In addition to named entity identification, OpenNLP
includes tools for sentence detection, part-of-speech-tagging, and more. JRuby
makes it possible to use Java libraries from Ruby code.
When the names have been identified, I want to tag them using an ad-hoc HTML
Microformat where the names get
enclosed an a <span>-tag that has a CSS-class corresponding to the type of
named entity. For example, the text “Mrs. Schmidt drives to Copenhagen on
Monday.” should be processed into “Mrs. <span class="person">Schmidt</span>
drives to <span class="location">Copenhagen</span> on <span class="date">Monday</span>.“
Dependencies
In order to follow along with the examples here, you need a working JRuby
installation and the following libraries and data files.
Libraries
- OpenNLP Tools: The basic tools which include the named entity detection;
- OpenNLP Maxent: a machine learning package required for loading the language models;
- GNU Trove: a library of java collections that are needed by OpenNLP.
Model Data
The following models are used by OpenNLP to identify sentences as named
entities.
Getting Started
First, we need to load all the dependencies into our application or interactive
Ruby session.
require 'java'
require File.join(File.dirname(__FILE__), '../deps/opennlp-tools-1.4.3.jar')
require File.join(File.dirname(__FILE__), '../deps/maxent-2.5.2.jar')
require File.join(File.dirname(__FILE__), '../deps/trove-2.1.0.jar')
java_import Java::opennlp.tools.namefind.NameFinderME
java_import Java::opennlp.tools.tokenize.SimpleTokenizer
java_import Java::opennlp.maxent.io.BinaryGISModelReader
java_import Java::opennlp.tools.lang.english.SentenceDetector
java_import Java::opennlp.tools.util.Span
By requiring the ‘java’ package, we tell JRuby to make the Java
interoperability available. Require lets us load jar files in addition to is
primary function of loading ruby code. When we want to use Java classes
directly in our current namespace, we can use the java_import method.
Next, I initialize the objects that will do the work.
@tokenizer = SimpleTokenizer.new
@detector = SentenceDetector.new(File.expand_path("models/EnglishSD.bin.gz"))
@finders = %w{person location date organization}.map do |model|
[model, NameFinderME.new(BinaryGISModelReader.new(java.io.File.new("models/#{model}.bin.gz")).getModel)]
end
The tokenizer is needed to split the text into words which are passed to the
entity identifier. The sentence detector needs to load a model for identifying
a sentence in a given language. This is necessary, because it needs to be
aware of things like abbreviations that have periods behind then, otherwise it
would take “Mrs.” in the example above as a sentence on its own. The finders
are responsible for the actual identification. We need one of these for each
of the models to apply. Here, I want to identify persons, locations, dates,
and organizations. The loading of the models is relatively expensive, so you
will probably want to do that only once in a real application and hold on to
the created objects.
Now that we are done with initializations, we can finally begin to do some
text processing.
sentences = @detector.sentDetect(text)
sentence_positions = [0] + @detector.sentPosDetect(text)
Here, I split the text into sentences. It is generally recommended to use the
entity identification on sentences rather than full documents, as it avoids
mistakenly detecting names across sentence boundaries and also improves the
accuracy. In addition to finding the sentences as a list of strings, I need a
list of the positions where the sentences start so that I can insert the tags
into the text later. I prepend position 0 to the list, because sentPosDetect
only returns the positions where a sentence is finished, but I also need the
beginning of the first sentence.
annotations = []
sentences.each_with_index do |sentence, sentence_idx|
token_positions = @tokenizer.tokenizePos(sentence)
tokens = Span.spansToStrings(token_positions, sentence)
@finders.each do |model, finder|
names = finder.find(tokens)
annotations += names.map do |name|
start_pos = token_positions[name.getStart].getStart + sentence_positions[sentence_idx]
end_pos = token_positions[name.getEnd-1].getEnd + sentence_positions[sentence_idx]
Struct::Annotation.new(model, Range.new(start_pos, end_pos))
end
end
end
I iterate over all sentences, breaking the sentence into tokens. Again, I need
the tokens as strings and also their positions for inserting the tags. The
search for named entities is done by applying each finder to provide me with a
list of names. These names are Span objects which give the indexes of the
tokens where a detected entity starts and ends. The results are gathered in
the annotations list, where I put Annotation structures. The annotations
contain the name of the model that was matched and a range of the position
where the entity starts in the original text calculated based on the sentence
and token offsets.
When all the annotations have been gathered, we can apply the tags to the original text.
@formatter.apply_annotations(text, annotations)
The formatter applies the annotations by putting <span> tags around the
named entities.
I put this code into a small library that you can find on GitHub.