Motivation
Prior to version 3.2.0, there was no clear and consistent mechanism for handling processing of content in different languages across all components in the pipeline and across different clustering algorithms. The approach of having one global active-language had a number of deficiencies:
No support for mixed-language collections of documents. If the input collection of documents contained 50 documents in English and 50 documents in German, what active-language should we set?
Inconsistencies between components. No clear solution for a case where one component (e.g. clustering algorithm) can handle language A set as active-language, but some other component in the pipeline (e.g. document source) cannot handle that language. Should we put some sort of reconciliation in place?
Awkward code and hacks at places (core, workbench).
Solution
One solution to the above problems is to remove the global active-language attribute and replace it with a per-Document language field and default-language for clustering algorithms. Document sources or direct clustering callers will set the appropriate language for each document if possible. Otherwise, the clustering algorithm's default-language will apply.
There is no need to put further constraints on how the clustering algorithms use the language field. Once scenario we've used in the past is to divide the documents into subsets based on the language and then cluster each subset separately. The resulting clusters can be shown: mixed up in one flat list, clusters for the major language plus (Other Languages) group, with parent clusters for each language. Additionally, the clustering algorithm could set the cluster's assumed language in a dedicated attribute of the cluster. If the clustering algorithm cannot handle some language, it could log a warning and assume they are written in default-language.
Use cases
- Clustering directly provided documents: set the
languagefor each document and let the clustering algorithm handle it. - Clustering documents from a document source with language attribute: set the source-specific language attribute(s), the source will set the documents'
languageappropriately and the clustering algorithm will handle them. - Clustering documents from a document source without language settings: either set the algorithm's
default-languageto the desired language or introduce a simple component between the source and the algorithm that will set thelanguagebased on some criteria.
Discussion
Natural support for clustering mixed-language collections of documents.
Natural support for a language recognizer component we may want to introduce in the future
No language inconsistencies between components.
Potential confusion: users may change the algorithm's default-language and see no difference in clustering results because all documents have their language, which overshadow the default-language. In theory, we could introduce some sort of "force language" attribute in the algorithms, but this doesn't seem a frequent case and can always be implemented by inserting a language-setter component between the source and the algorithm.
Backward compatibility
Two backward incompatible changes will be made:
- Removing the
active-languageattribute and replacing it with the differently nameddefault-language.
- Moving
LanguageCodefromorg.carrot2.text.linguistictoorg.carrot2.corewill make the XMLs withactive-languagesaved from earlier versions incompatible with the new version.
Implementation
Required tasks
Core
- Remove the global
active-languageattribute. - Define a
languagefield inDocument. Type of the field should beLanguageCode. Anullvalue is allowed and means unknown language or a language outside of theLanguageCodeconstants.
Document sources
carrot2-source-ambient: hardcode the language to Englishcarrot2-source-boss: set documents'languagebased onBossSearchService.languageAndRegion, for news document source, use the document language returned by the APIcarrot2-source-etools: set documents'languagebased onEToolsDocumentSource.languagecarrot2-source-pubmed: hardcode the language to Englishcarrot2-source-yahoo: set the documents'languagebased on the language, region and country attributes
Clustering algorithms
- Implement a utility for splitting a list of documents into subsets by language and then combining the resulting clusters into a unified list of clusters according to one of the (selectable) strategies:
- All clusters on the same list
- Clusters for the major language, (Other languages) group for clusters of other languages
- Dedicated parent cluster for all present languages
- Add the
default-languageattribute, which will determine the fallback language to assign to documents withlanguageset tonull. - Convert the existing algorithms to partition the input documents by language and cluster each partition separately
Workbench
- Remove the hack around
active-language, everything should work smoothly without hacks.
Example code
- update
org.carrot2.examples.clustering.ClusteringNonEnglishContentto reflect the refactorings
Optional tasks
Document sources
carrot2-source-google: investigate how to pass the language to Google API (as this is a JSON protocol, this might be through a HTTP header), set documents'languagebased on the new attributecarrot2-source-lucene: add language setting support toSimpleFieldMapper. We may also expose some simple interface for transforming field values to theLanguageCodeconstantcarrot2-source-solr: similarly tocarrot2-source-lucene, introduce a way to map some field to the document'slanguage
Solr plugin
- Add support for mapping a Solr document field to a Carrot2 document language field