Identifying Names with OpenNLP and JRuby

When trying to make sense of text, nouns are a good place to start, as they usually convey the subject of the text. To be more precise, when analyzing text to find out, what it’s about, we want to identify among other things people, organizations, and locations — also known as named entities.

In this post, I take a look at how to find named entities using OpenNLP and JRuby. OpenNLP is a collection of tools for natural language processing written in Java. In addition to named entity identification, OpenNLP includes tools for sentence detection, part-of-speech-tagging, and more. JRuby makes it possible to use Java libraries from Ruby code.

When the names have been identified, I want to tag them using an ad-hoc HTML Microformat where the names get enclosed an a <span>-tag that has a CSS-class corresponding to the type of named entity. For example, the text “Mrs. Schmidt drives to Copenhagen on Monday.” should be processed into “Mrs. <span class="person">Schmidt</span> drives to <span class="location">Copenhagen</span> on <span class="date">Monday</span>.

Dependencies

In order to follow along with the examples here, you need a working JRuby installation and the following libraries and data files.

Libraries

  • OpenNLP Tools: The basic tools which include the named entity detection;
  • OpenNLP Maxent: a machine learning package required for loading the language models;
  • GNU Trove: a library of java collections that are needed by OpenNLP.

Model Data

The following models are used by OpenNLP to identify sentences as named entities.

Getting Started

First, we need to load all the dependencies into our application or interactive Ruby session.

require 'java'
require File.join(File.dirname(__FILE__), '../deps/opennlp-tools-1.4.3.jar')
require File.join(File.dirname(__FILE__), '../deps/maxent-2.5.2.jar')
require File.join(File.dirname(__FILE__), '../deps/trove-2.1.0.jar')

java_import Java::opennlp.tools.namefind.NameFinderME
java_import Java::opennlp.tools.tokenize.SimpleTokenizer
java_import Java::opennlp.maxent.io.BinaryGISModelReader
java_import Java::opennlp.tools.lang.english.SentenceDetector
java_import Java::opennlp.tools.util.Span

By requiring the ‘java’ package, we tell JRuby to make the Java interoperability available. Require lets us load jar files in addition to is primary function of loading ruby code. When we want to use Java classes directly in our current namespace, we can use the java_import method.

Next, I initialize the objects that will do the work.

@tokenizer = SimpleTokenizer.new
@detector = SentenceDetector.new(File.expand_path("models/EnglishSD.bin.gz"))
@finders = %w{person location date organization}.map do |model|
  [model, NameFinderME.new(BinaryGISModelReader.new(java.io.File.new("models/#{model}.bin.gz")).getModel)]
end

The tokenizer is needed to split the text into words which are passed to the entity identifier. The sentence detector needs to load a model for identifying a sentence in a given language. This is necessary, because it needs to be aware of things like abbreviations that have periods behind then, otherwise it would take “Mrs.” in the example above as a sentence on its own. The finders are responsible for the actual identification. We need one of these for each of the models to apply. Here, I want to identify persons, locations, dates, and organizations. The loading of the models is relatively expensive, so you will probably want to do that only once in a real application and hold on to the created objects.

Now that we are done with initializations, we can finally begin to do some text processing.

sentences = @detector.sentDetect(text)
sentence_positions = [0] + @detector.sentPosDetect(text)

Here, I split the text into sentences. It is generally recommended to use the entity identification on sentences rather than full documents, as it avoids mistakenly detecting names across sentence boundaries and also improves the accuracy. In addition to finding the sentences as a list of strings, I need a list of the positions where the sentences start so that I can insert the tags into the text later. I prepend position 0 to the list, because sentPosDetect only returns the positions where a sentence is finished, but I also need the beginning of the first sentence.

annotations = []
sentences.each_with_index do |sentence, sentence_idx|
  token_positions = @tokenizer.tokenizePos(sentence)
  tokens = Span.spansToStrings(token_positions, sentence)
  @finders.each do |model, finder|
    names = finder.find(tokens)
    annotations += names.map do |name|
      start_pos = token_positions[name.getStart].getStart + sentence_positions[sentence_idx]
      end_pos = token_positions[name.getEnd-1].getEnd + sentence_positions[sentence_idx]
      Struct::Annotation.new(model, Range.new(start_pos, end_pos))
    end
  end
end

I iterate over all sentences, breaking the sentence into tokens. Again, I need the tokens as strings and also their positions for inserting the tags. The search for named entities is done by applying each finder to provide me with a list of names. These names are Span objects which give the indexes of the tokens where a detected entity starts and ends. The results are gathered in the annotations list, where I put Annotation structures. The annotations contain the name of the model that was matched and a range of the position where the entity starts in the original text calculated based on the sentence and token offsets.

When all the annotations have been gathered, we can apply the tags to the original text.

@formatter.apply_annotations(text, annotations)

The formatter applies the annotations by putting <span> tags around the named entities.

I put this code into a small library that you can find on GitHub.

This entry was posted in programming and tagged , , . Bookmark the permalink.

Comments are closed.