Skip to end of metadata
Go to start of metadata

Motivation

Prior to version 3.2.0, there was no clear and consistent mechanism for handling processing of content in different languages across all components in the pipeline and across different clustering algorithms. The approach of having one global active-language had a number of deficiencies:

Minus No support for mixed-language collections of documents. If the input collection of documents contained 50 documents in English and 50 documents in German, what active-language should we set?
Minus Inconsistencies between components. No clear solution for a case where one component (e.g. clustering algorithm) can handle language A set as active-language, but some other component in the pipeline (e.g. document source) cannot handle that language. Should we put some sort of reconciliation in place?
Minus Awkward code and hacks at places (core, workbench).

Solution

One solution to the above problems is to remove the global active-language attribute and replace it with a per-Document language field and default-language for clustering algorithms. Document sources or direct clustering callers will set the appropriate language for each document if possible. Otherwise, the clustering algorithm's default-language will apply.

There is no need to put further constraints on how the clustering algorithms use the language field. Once scenario we've used in the past is to divide the documents into subsets based on the language and then cluster each subset separately. The resulting clusters can be shown: mixed up in one flat list, clusters for the major language plus (Other Languages) group, with parent clusters for each language. Additionally, the clustering algorithm could set the cluster's assumed language in a dedicated attribute of the cluster. If the clustering algorithm cannot handle some language, it could log a warning and assume they are written in default-language.

Use cases

  • Clustering directly provided documents: set the language for each document and let the clustering algorithm handle it.
  • Clustering documents from a document source with language attribute: set the source-specific language attribute(s), the source will set the documents' language appropriately and the clustering algorithm will handle them.
  • Clustering documents from a document source without language settings: either set the algorithm's default-language to the desired language or introduce a simple component between the source and the algorithm that will set the language based on some criteria.

Discussion

Plus Natural support for clustering mixed-language collections of documents.
Plus Natural support for a language recognizer component we may want to introduce in the future
Plus No language inconsistencies between components.

Minus Potential confusion: users may change the algorithm's default-language and see no difference in clustering results because all documents have their language, which overshadow the default-language. In theory, we could introduce some sort of "force language" attribute in the algorithms, but this doesn't seem a frequent case and can always be implemented by inserting a language-setter component between the source and the algorithm.

Backward compatibility

Two backward incompatible changes will be made:

  • Removing the active-language attribute and replacing it with the differently named default-language.
  • Moving LanguageCode from org.carrot2.text.linguistic to org.carrot2.core will make the XMLs with active-language saved from earlier versions incompatible with the new version.

Implementation

Required tasks

Core
  • Remove the global active-language attribute.
  • Define a language field in Document. Type of the field should be LanguageCode. A null value is allowed and means unknown language or a language outside of the LanguageCode constants.
Document sources
  • carrot2-source-ambient: hardcode the language to English
  • carrot2-source-boss: set documents' language based on BossSearchService.languageAndRegion, for news document source, use the document language returned by the API
  • carrot2-source-etools: set documents' language based on EToolsDocumentSource.language
  • carrot2-source-pubmed: hardcode the language to English
  • carrot2-source-yahoo: set the documents' language based on the language, region and country attributes
Clustering algorithms
  • Implement a utility for splitting a list of documents into subsets by language and then combining the resulting clusters into a unified list of clusters according to one of the (selectable) strategies:
    • All clusters on the same list
    • Clusters for the major language, (Other languages) group for clusters of other languages
    • Dedicated parent cluster for all present languages
  • Add the default-language attribute, which will determine the fallback language to assign to documents with language set to null.
  • Convert the existing algorithms to partition the input documents by language and cluster each partition separately
Workbench
  • Remove the hack around active-language, everything should work smoothly without hacks.
Example code
  • update org.carrot2.examples.clustering.ClusteringNonEnglishContent to reflect the refactorings

Optional tasks

Document sources
  • carrot2-source-google: investigate how to pass the language to Google API (as this is a JSON protocol, this might be through a HTTP header), set documents' language based on the new attribute
  • carrot2-source-lucene: add language setting support to SimpleFieldMapper. We may also expose some simple interface for transforming field values to the LanguageCode constant
  • carrot2-source-solr: similarly to carrot2-source-lucene, introduce a way to map some field to the document's language
Solr plugin
  • Add support for mapping a Solr document field to a Carrot2 document language field
Labels: