Goal
Solve confusion and minor bugs around the lexical resource loading, make them coherent across all software based on the Carrot2 framework.
Related issues
Implementation
- Create a
ResourceManagercomponent. TheResourceManagerwould be responsible for loading all external resources, such as dictionaries, license files etc. It would maintain its ownResourceUtilsinstance with the following locators, in the order of look up attempts:resource-dirdirectory locator, if provided- working directory locator
- context class loader locator
- class-relative locator

- Expose an
@Initattribute inResourceManagerfor one additional resource locator added at the top of the default resource locators list. The locator can itself delegate to a cascade of other locators, if needed. - Move the
reload-resourcesattribute fromDefaultLanguageModelFactorytoResourceManager. - Remove the
resource-pathattribute. Issue a warning if a value for the attribute is provided. - Initialize locators in the
ResourceManager#init()method, which will have to be invoked explicitly from the enclosingIProcessingComponentafter the component is initialized. - Implement the
resource-dirattribute. Ifresource-dirisnullor an empty string, the file system directory look up should not be performed. Ifresource-diris not empty, a warning should be issued if the path does not denote and existing directory. - If
reload-resourcesisfalse,ResourceManagercaches the requestedIResource. Ifreload-resourcesistrue,ResourceManagerloads the resource once again and puts the new version in the cache. The cache is shared among all instances ofResourceManager. - Remove all unnecessary references to
ResourceUtils(e.g. in tests), replace with specificIResource(mostlyClassResource)
. - Write useful log messages on the DEBUG level, including all locators examined, including the failing and the matching one.
Implications
- Removal of the
resource-pathattribute. The change needs to be documented in the release notes. - Lingo3G's
/resourcesmoved to/. Build scripts need to be updated to put the resources at the new location, the change needs to be documented in the release notes. - A single
ResourceManagerfor all components of an algorithm. TheResourceManagerinstance, created in the main algorithm class (e.g. LingoClusteringAlgorithm) or some parent component (e.g.CompletePreprocessingPipeline) will have to be passed to the appropriate subcomponents (e.g. theDefaultLanguageModelFactory. It would be tempting to put theResourceManagerin the preprocessing context, but then we'd have to put thereload-resourcesandresource-dirattributes somewhere. - If
ResourceManageris separate from the component that parses the resources it delivers, the component will have to check thereload-resourcesflag on its own and re-parse the resource appropriately.
Questions
- How do we implement Lingo3G license loading (home directory)? Use a dedicated ResourceUtils that will first check the license-specific locations (system property, home directory) and if license is not found, delegate the search to the algorithm's
ResourceManagerinstance. - The fact that Lingo3G is using only a part of the stemmer/tokenizer part
DefaultLanguageModelFactoryis a bit of a problem because it causes Carrot2's lexical resources to be loaded in vain and can cause errors if they're not found. One important constraint here is that Solr's clustering component overrides the default language model factory, so both Carrot2 algorithms and Lingo3G need to obtain stemmers and tokenizers from the overridden factory and not call Carrot2's default factories directly. One way to solve the problem would be to split theILanguageModelfactory into two interfaces and have Lingo3G use only one of them. Alternatively, we could remove theisStopWordandisStopLabelaltogether and move the implementation to the places where they are needed. If we follow the latter approach, we'd still need to expose some interfaces for the methods, so that e.g. Carrot2 Solr clustering component can use Solr's lexical resources to implement them. Yet another solution would be to make theDefaultLanguageModelFactory#createStemmer()andDefaultLanguageModelFactory#createTokenizer()method public. All of the above solutions will require and update to Solr's Carrot2 clustering component.
Alternatives
- Should we rename the
resource-pathattribute to something more accurate, e.g.resource-classpath-prefix? This is a little tricky because, ideally, we should keep the old attribute for some time too. In fact, now I'm thinking we could add@Deprecatedsupport toAttributeBinder, so that it could issue warnings if such attribute is bound. - As an alternative to the above, we could assume the
resource-pathdenotes a filesystem path (not sure if anyone used it as a classpath, the only use I'm aware of was the misuse as a filesystem path) and do not support prefixing of the resource locations. If someone would like to read them from a specific classpath location, then they would have to add a custom locator. - What should
reload-resourcesactually do? Should it reload the static resource cache (visible for all components) or simply return a freshly loaded resource (visible only to one component). - Implementing the extra locator passed as an
@Initattribute would not be elegant if we don't modify theAttributeBindersomehow to let the bindable do something after an attribute is bound. Two solutions that spring to my mind would be: 1) adding support for setting attributes by a setter, 2) designating an after-bound method theAttributeBinderwould call. We may also want to add some simple JUnitBenchmark tests for theAttributeBinderto make sure the overhead doesn't grow too much. Yet another approach would be explicit initialization invoked in theIProcessingComponent#init()on the specific components that would then initialize the resource manager.
The linguistic branch
I used many of the ideas from this document while working on refactoring of the linguistic stuff. In the end it seems more natural to extract the lexical data component (ILexicalData) that manages resource-reloading appropriately (instead of designing another resource manager). The benefit from this is that resource reloading triggers reloading of the post-processed internal structure of linguistic resources and that this post-processed internal structure can be more naturally shared between all interested users of ILexicalDataFactory. Out of many different design strategies, this one appealed to me most. Another potential benefit I see is that it allows for easy refactoring of ResourceUtils (so that resource utils are passed as part of the context – during attribute binding). I will keep working on this on the branch.
2 Comments
Hide/Show CommentsOct 07, 2010
Dawid Weiss
The problem I see is that on-@init ResourceManager (and its associated resources) will be duplicated if pooling controller is used. In C#, the initialization (loading, parsing and initial processing) of resources is costly, so doing that multiple times (for whatever size pool one has) is a no-no. The same applies to Java, the current pool uses weak refs. In case the memory goes low and components are constantly released from the pool, things will get gritty. And of course multiple ResourceManagers result in more duplicated resources being consumed in memory.
Oct 07, 2010
Stanisław Osiński
Good point. I've added an explanation that the cache is to be shared across all instances of
ResourceManager.