How to use Stanford NER with Spanish text

Last week I was trying to find a Java library to execute NER (Named Entity Recognition) in Spanish. I have used FreeLing in the past and I have to say that it’s quite good. The point is that this time I wanted to avoid making calls to C code from Java. My first intention was to try using OpenNLP. I made some tests with his Spanish models but I felt quite disappointed with the accuracy so I decided to move on and look for another library. The next one was the Stanford Named Entity Recognizer. This is a pure Java 8 library from the Stanford Natural Language Processing Group. After some tests, I realized that this would be my choice. The precision was really good, both in English and Spanish.

These are the steps to follow in order to integrate it as a Java library:

  1. Download latest version of Stanford Entity Recognizer (currently 3.6.0)
  2. Unzip it. You will find some shell scripts to play, the jar (stanford-ner.jar), javadoc and sources. There is also a folder named ‘classifiers’ with the English models and a folder ‘lib’ with the extra needed dependencies.
  3. Download spanish models (this is a jar file, unzip it and go to /edu/stanford/nlp/models/ner folder. Then copy the “spanish.ancora.distsim.s512.crf.ser.gz” file to a folder included on your application classpath. For example src/main/resources.
  4. Add stanford-ner.jar to the classpath of your application. I decided to “mavenize” this jar as a local maven artifact, using the standard Maven command:
mvn install:install-file -Dfile=stanford-ner.jar -DgroupId=com.tesnick.ai.nlp.ner -DartifactId=stanford-jar-Dversion=3.6.0 -Dpackaging=jar

After this I added to my pom.xml the dependency:

<dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.12</version>
    <scope>test</scope>
</dependency>

(maybe you need to some extra dependencies, like slf4j or joda-time. In my case I had only to add slf4j-api)

Finally, to use it you can try something like:

String spanishSerializedClassifier = "spanish.ancora.distsim.s512.crf.ser.gz";

String englishSerializedClassifier = "english.all.3class.distsim.crf.ser.gz";
AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifier(spanishSerializedClassifier);

List<List<CoreLabel>> classifier.classify("David Bowie toma las calles del mundo.");

for (List<CoreLabel> coreLabels : apply) {
    System.out.println(coreLabels);

    for (CoreLabel word : coreLabels) {
        System.out.print(word.word() + '/' + word.get(CoreAnnotations.AnswerAnnotation.class) + ' ');
    }
}
The result will be:

[David, Bowie, toma, las, calles, del, mundo, .]
David/PERSON Bowie/PERSON toma/O las/O calles/O del/O mundo/O ./O  

 You can download a sample code project on my github page.

Anuncios
Esta entrada fue publicada en Artificial Intelligence, named entity recognition, natural language processing, ner, nlp, Uncategorized. Guarda el enlace permanente.

Responder

Introduce tus datos o haz clic en un icono para iniciar sesión:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión / Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión / Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión / Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión / Cambiar )

Conectando a %s