Start line:  
End line:  

Snippet Preview

Snippet HTML Code

Stack Overflow Questions
WEBLAB: Service oriented integration platform for media mining and intelligence applications Copyright (C) 2004 - 2009 EADS DEFENCE AND SECURITY SYSTEMS This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
 
 
 package org.ow2.weblab.service.language;
 
 import java.io.File;
 import java.util.List;
 import java.util.Map;
 
 
 import  org.apache.commons.logging.Log;
 import  org.apache.commons.logging.LogFactory;
 import  org.weblab_project.core.factory.AnnotationFactory;
 import  org.weblab_project.core.helper.PoKHelper;
 import  org.weblab_project.core.helper.RDFHelperFactory;
 import  org.weblab_project.core.model.Annotation;
 import  org.weblab_project.core.model.ComposedUnit;
 import  org.weblab_project.core.model.MediaUnit;
 import  org.weblab_project.core.model.Resource;
 import  org.weblab_project.core.model.text.Text;
 import  org.weblab_project.core.ontologies.DublinCore;
 import  org.weblab_project.core.ontologies.WebLab;
 import  org.weblab_project.core.properties.PropertiesLoader;
 import  org.weblab_project.core.util.ResourceUtil;
 import  org.weblab_project.services.analyser.Analyser;
 import  org.weblab_project.services.analyser.ProcessException;
 import  org.weblab_project.services.analyser.types.ProcessArgs;
 import  org.weblab_project.services.analyser.types.ProcessReturn;
 import  org.weblab_project.services.exception.WebLabException;


This class is a WebLab Web service for identifying the language of a Text.
It's a wrapper of the NGramJ project: "http://ngramj.sourceforge.net/". It uses the CNGram system that can computes character string instead of raw text files.
This algorithm return for each input text a score associated to every language profile previously learned (.ngp files). The score is a double between 0 and 1. 1 meaning that this text is written in this language for sure. 0 on the opposite means that this text is not written in this language. The sum of score equals 1.
Our wrapper annotate every Text section of a ComposedUnit in input (or the Text if the input is a Text). It fails if the input is something else. On each Text it uses CGram to determine which language profile are the best candidate to be annotated (using DC:language property). It can be configured using a property file named ngram.properties. In this file you can handle 6 properties.
  • minSingleValue: It's a double value between 0 and 1. If the best language score is greater than this value, it will be the only one annotated on a given Text
  • minMultipleValue: It's a double value between 0 and 1. Every language score that are greater than this value, will be annotated on a given Text.
  • maxNbValues: It's a positive integer value. The list of annotated language on a given Text could not be greater that this value.
  • profilesFolderPath: It's a String that represents a folder path; This folder contains .ngp files that will be loaded instead of default CNGram 28 languages.
  • addTopLevelAnnot: It's a boolean value. It defines whether or not to annotate the whole document with the language extracted from the concatenation of every Text content.
  • isProducedByObject: It's a String value that should be a valid URI. It defines the URI to be used as object of every isProducedBy statements on annotations created by the service.
Those 6 properties are optional. Default values are:
  • minSingleValue: '0.75'
  • minMultipleValue: '0.15'
  • maxNbValues: '1'
  • profilesFolderPath: in this case, we use the default constructor for CNGram profile that will use default profile given in their jar file. These 28 profiles are named using ISO 639-1 two letters language code; it means that the DC:language annotation resulting will be in this format. If you want to use another format, you have use a custom profiles folder (containing .ngp files).
  • addTopLevelAnnot: false
  • isProducedByObject: in this case, no isProducedBy annotation will be created.

Author(s):
EADS IPCC Team
Date:
2009-11-05
 
 @WebService(endpointInterface = "org.weblab_project.services.analyser.Analyser")
 public class LanguageExtraction implements Analyser {
 
 	private final static String PROPERTY_FILE = "ngram.properties";
 
 	private final static Log LOG = LogFactory.getLog(LanguageExtraction.class);
 
 	private final static double DEFAULT_MIN_SINGLE_VALUE = 0.75;
 	private final static double DEFAULT_MIN_MULTIPLE_VALUE = 0.15;
 	private final static int DEFAULT_MAX_NB_VALUES = 1;
 
 	private final static String MIN_SINGLE_VALUE = "minSingleValue";
 	private final static String MIN_MULTIPLE_VALUE = "minMultipleValue";
	private final static String MAX_NB_VALUES = "maxNbValues";
	private final static String PROFILES_FOLDER_PATH = "profilesFolderPath";
	private final static String ADD_TOP_LEVEL_ANNOT = "addTopLevelAnnot";
	private final static String IS_PRODUCED_BY_OBJECT = "isProducedByObject";
	private static final String UNKNOWN = "UNKNOWN";
	private double minSingleValue;
	private double minMultipleValue;
	private int maxNbValues;
	private boolean addTopLevelAnnot;
Read the property file to get fields values.
	public void init() throws LanguageExtractionException {
		Map<StringStringprops = PropertiesLoader.loadProperties();
		final String minSingleValueP = props.get();
		if (minSingleValueP != null && !minSingleValueP.isEmpty()) {
			try {
				this. = Double.parseDouble(minSingleValueP);
catch (final NumberFormatException nfe) {
				.warn("Unable to parse double for " +  + " property. Value was: '" + minSingleValueP + "'.");
			}
else {
		}
		final String minMultipleValueP = props.get();
		if (minMultipleValueP != null && !minMultipleValueP.isEmpty()) {
			try {
				this. = Double.parseDouble(minMultipleValueP);
catch (final NumberFormatException nfe) {
				.warn("Unable to parse double for " +  + " property. Value was: '" + minMultipleValueP + "'.");
			}
else {
		}
		if (this. < this.) {
			.warn( + " was smaller than " +  + ". Use the two default value instead.");
		}
		.debug("LanguageExtraction initialised with " +  + "=" + this.);
		.debug("LanguageExtraction initialised with " +  + "=" + this.);
		final String maxNbValuesP = props.get();
		if (maxNbValuesP != null && !maxNbValuesP.isEmpty()) {
			try {
				this. = Integer.parseInt(maxNbValuesP);
catch (final NumberFormatException nfe) {
				.warn("Unable to parse double for " +  + " property. Value was: '" + maxNbValuesP + "'.");
			}
else {
		}
		if (this. < 1) {
			.warn( + " was smaller than 1. Use the two default value instead.");
		}
		.debug("LanguageExtraction initialised with " +  + "=" + this.);
		final String profilesFolderPathP = props.get();
		if (profilesFolderPathP != null && !profilesFolderPathP.isEmpty()) {
			File file = new File(profilesFolderPathP);
			if (!file.exists()) {
				.warn("File '" + file.getAbsolutePath() + "' does not exists. Creating LanguageExtraction with default configuration.");
				try {
					this. = new NGramProfilesPatched();
catch (final IOException ioe) {
					throw new LanguageExtractionException("Unable to create NGramProfilesPatched using default value."ioe);
				}
else if (!file.canRead()) {
				.warn("File '" + file.getAbsolutePath() + "' is not readable. Creating LanguageExtraction with default configuration.");
				try {
					this. = new NGramProfilesPatched();
catch (final IOException ioe) {
					throw new LanguageExtractionException("Unable to create NGramProfilesPatched using default value."ioe);
				}
else if (!file.isDirectory()) {
				.warn("File '" + file.getAbsolutePath() + "' is not a directory. Creating LanguageExtraction with default configuration.");
				try {
					this. = new NGramProfilesPatched();
catch (final IOException ioe) {
					throw new LanguageExtractionException("Unable to create NGramProfilesPatched using default value."ioe);
				}
else {
				try {
					this. = new NGramProfilesPatched(file);
catch (final IOException ioe) {
							.warn(
									"Unable to create NGramProfilesPatched using value of " +  + " property. Value was: '" + file.getAbsolutePath()
"'. Try to create default one."ioe);
					try {
						this. = new NGramProfilesPatched();
catch (final IOException ioe2) {
						throw new LanguageExtractionException("Unable to create NGramProfilesPatched using default value."ioe2);
					}
				}
			}
else {
			try {
				this. = new NGramProfilesPatched();
catch (final IOException ioe) {
				throw new LanguageExtractionException("Unable to create NGramProfilesPatched using default value."ioe);
			}
		}
		if (.isDebugEnabled()) {
			sb.append("LanguageExtraction initialised with the following " + this..getProfileCount() + " language profiles: [");
			for (int p = 0; p < this..getProfileCount(); p++) {
				if (p < this..getProfileCount() - 1) {
					sb.append(", ");
else {
					sb.append("]");
				}
			}
			.debug(sb.toString());
		}
		final String addTopLevelAnnotP = props.get();
		if (addTopLevelAnnotP != null && !addTopLevelAnnotP.isEmpty()) {
			this. = Boolean.parseBoolean(addTopLevelAnnotP);
		}
		// May be null
	}
	/*
	 * (non-Javadoc)
	 * 
	 * @see org.weblab_project.services.analyser.Analyser#process(org.weblab_project.services.analyser.types.ProcessArgs)
	 */
	public ProcessReturn process(ProcessArgs processArgsthrows ProcessException {
		List<Text> texts = this.checkArgs(processArgs);
		final boolean topLevelAnnot = this. && (processArgs.getResource() instanceof ComposedUnit);
		for (Text text : texts) {
			if (text.getContent() == null || text.getContent().isEmpty()) {
				.debug("Text '" + text.getUri() + "' has no content; ignored.");
				continue;
			}
			List<StringprofileToAnnotate = this.checkLanguage(text.getContent(), text.getUri());
			this.annotate(textprofileToAnnotate);
			if (topLevelAnnot) {
				sb.append(text.getContent());
				sb.append("\n\n\n");
			}
		}
		if (topLevelAnnot && sb.length() > 0) {
			ComposedUnit cu = (ComposedUnit) processArgs.getResource();
			List<StringprofileToAnnotate = this.checkLanguage(sb.toString(), cu.getUri());
			this.annotate(cuprofileToAnnotate);
		}
		ProcessReturn pr = new ProcessReturn();
		pr.setResource(processArgs.getResource());
		return pr;
	}

Parameters:
res The resource to be annotated
profileToAnnotate The language to annotate using dc:language property statements on res.
	private void annotate(Resource resfinal List<StringprofileToAnnotate) {
		Annotation annot = AnnotationFactory.createAndLinkAnnotation(res);
		PoKHelper pokH = RDFHelperFactory.getPoKHelper(annot);
		pokH.setAutoCommitMode(false);
		for (final String language : profileToAnnotate) {
			pokH.createLitStat(res.getUri(), DublinCore.LANGUAGE_PROPERTY_NAME, language);
		}
		pokH.commit();
		if (this. != null) {
			Annotation annot2 = AnnotationFactory.createAndLinkAnnotation(annot);
			PoKHelper pokH2 = RDFHelperFactory.getPoKHelper(annot2);
			pokH2.createResStat(annot.getUri(), WebLab.IS_PRODUCED_BY, this.);
		}
	}

Parameters:
content The text to identify language
uri The uri, used for logging purpose.
Returns:
An ordered list of language identified according to parameters (minSingleValue, maxNbValues and minMultipleValue).
	private List<StringcheckLanguage(final String contentfinal String uri) {
		List<StringprofileToAnnotate = new LinkedList<String>();
		Ranker ranker = this..getRanker();
		ranker.account(content);
		RankResult result = ranker.getRankResult();
		boolean warn = false;
		// Profile are listed in their rank order	
		final double bestScore = result.getScore(0);
		if (bestScore > this.) {
			profileToAnnotate.add(result.getName(0));
else if (bestScore < this.) {
			profileToAnnotate.add();
			warn = true;
else {
			final int max = Math.min(result.getLength(), this.);
			for (int p = 0; p < maxp++) {
				if (result.getScore(p) >= this.) {
					profileToAnnotate.add(result.getName(p));
else {
					break;
				}
			}
		}
		if (.isDebugEnabled() || warn) {
			sb.append("Language detected for MediaUnit '" + uri + "' are: [");
			for (int p = 0; p < result.getLength(); p++) {
				sb.append(result.getName(p));
				sb.append(" - ");
				sb.append(result.getScore(p));
				if (p < result.getLength() - 1) {
					sb.append(" --|-- ");
else {
					sb.append("]");
				}
			}
			if (warn) {
				.warn(sb.toString());
				.warn("Unable to identify language for MediaUnit '" + uri + "'; " + profileToAnnotate + " will be annotated.");
else {
				.debug(sb.toString());
				.debug("Language to be annotated for MediaUnit '" + uri + "' are: " + profileToAnnotate);
			}
		}
		return profileToAnnotate;
	}

Parameters:
processArg The processArgs; i.e. a usageContext not used and a Resource that must either be a composedUnit or a text.
Returns:
A list of text contained in resource of processArgs
Throws:
ProcessException If processArgs is null; or if resource is null; or if resource is neither a ComposedUnit nor a Text.
	private List<Text> checkArgs(final ProcessArgs processArgthrows ProcessException {
		if (processArg == null) {
			throw createE1ProcessException("ProcessArgs was null.");
		}
		Resource res = processArg.getResource();
		if (res == null) {
			throw createE1ProcessException("Resource in ProcessArgs was null.");
		}
		if (!(res instanceof MediaUnit)) {
			throw createE1ProcessException("Resource in ProcessArgs was not an instance of MediaUnit but of '" + res.getClass().getCanonicalName() + "'.");
		}
		final List<Text> texts;
		if (res instanceof ComposedUnit) {
			texts = ResourceUtil.getSelectedSubResources(res, Text.class);
else if (res instanceof Text) {
			texts = new LinkedList<Text>();
			texts.add((Text) res);
else {
			throw createE1ProcessException("Resource in ProcessArgs was not neither an instance of ComposedUnit nor of Text but of '" + res.getClass().getCanonicalName() + "'.");
		}
		return texts;
	}

Parameters:
message The message to be used in ProcessException created
Returns:
A ProcessException containing message as message and a E1 WebLabException
	private static ProcessException createE1ProcessException(final String message) {
		WebLabException wle = new WebLabException();
		wle.setErrorId("E1");
		wle.setErrorMessage("Invalid parameter");
		final ProcessException pe = new ProcessException(messagewle);
		.error(messagepe);
		return pe;
	}
New to GrepCode? Check out our FAQ X