Child pages
  • Resource management clean-up
Skip to end of metadata
Go to start of metadata

Goal

Solve confusion and minor bugs around the lexical resource loading, make them coherent across all software based on the Carrot2 framework.

Related issues

Implementation

  1. Create a ResourceManager component. The ResourceManager would be responsible for loading all external resources, such as dictionaries, license files etc. It would maintain its own ResourceUtils instance with the following locators, in the order of look up attempts:
    • resource-dir directory locator, if provided
    • working directory locator
    • context class loader locator
    • class-relative locator (question)
  2. Expose an @Init attribute in ResourceManager for one additional resource locator added at the top of the default resource locators list. The locator can itself delegate to a cascade of other locators, if needed.
  3. Move the reload-resources attribute from DefaultLanguageModelFactory to ResourceManager.
  4. Remove the resource-path attribute. Issue a warning if a value for the attribute is provided.
  5. Initialize locators in the ResourceManager#init() method, which will have to be invoked explicitly from the enclosing IProcessingComponent after the component is initialized.
  6. Implement the resource-dir attribute. If resource-dir is null or an empty string, the file system directory look up should not be performed. If resource-dir is not empty, a warning should be issued if the path does not denote and existing directory.
  7. If reload-resources is falseResourceManager caches the requested IResource. If reload-resources is trueResourceManager loads the resource once again and puts the new version in the cache. The cache is shared among all instances of ResourceManager.
  8. Remove all unnecessary references to ResourceUtils (e.g. in tests), replace with specific IResource (mostly ClassResource(question).
  9. Write useful log messages on the DEBUG level, including all locators examined, including the failing and the matching one.

Implications

  1. Removal of the resource-path attribute. The change needs to be documented in the release notes.
  2. Lingo3G's /resources moved to /. Build scripts need to be updated to put the resources at the new location, the change needs to be documented in the release notes.
  3. A single ResourceManager for all components of an algorithm. The ResourceManager instance, created in the main algorithm class (e.g. LingoClusteringAlgorithm) or some parent component (e.g. CompletePreprocessingPipeline) will have to be passed to the appropriate subcomponents (e.g. the DefaultLanguageModelFactory. It would be tempting to put the ResourceManager in the preprocessing context, but then we'd have to put the reload-resources and resource-dirattributes somewhere.
  4. If ResourceManager is separate from the component that parses the resources it delivers, the component will have to check the reload-resources flag on its own and re-parse the resource appropriately.

Questions

  • How do we implement Lingo3G license loading (home directory)? Use a dedicated ResourceUtils that will first check the license-specific locations (system property, home directory) and if license is not found, delegate the search to the algorithm'sResourceManager instance.
  • The fact that Lingo3G is using only a part of the stemmer/tokenizer part DefaultLanguageModelFactory is a bit of a problem because it causes Carrot2's lexical resources to be loaded in vain and can cause errors if they're not found. One important constraint here is that Solr's clustering component overrides the default language model factory, so both Carrot2 algorithms and Lingo3G need to obtain stemmers and tokenizers from the overridden factory and not call Carrot2's default factories directly. One way to solve the problem would be to split the ILanguageModel factory into two interfaces and have Lingo3G use only one of them. Alternatively, we could remove the isStopWord and isStopLabel altogether and move the implementation to the places where they are needed. If we follow the latter approach, we'd still need to expose some interfaces for the methods, so that e.g. Carrot2 Solr clustering component can use Solr's lexical resources to implement them. Yet another solution would be to make theDefaultLanguageModelFactory#createStemmer() and DefaultLanguageModelFactory#createTokenizer() method public. All of the above solutions will require and update to Solr's Carrot2 clustering component.

Alternatives

  • Should we rename the resource-path attribute to something more accurate, e.g. resource-classpath-prefix? This is a little tricky because, ideally, we should keep the old attribute for some time too. In fact, now I'm thinking we could add@Deprecated support to AttributeBinder, so that it could issue warnings if such attribute is bound.
  • As an alternative to the above, we could assume the resource-path denotes a filesystem path (not sure if anyone used it as a classpath, the only use I'm aware of was the misuse as a filesystem path) and do not support prefixing of the resource locations. If someone would like to read them from a specific classpath location, then they would have to add a custom locator.
  • What should reload-resources actually do? Should it reload the static resource cache (visible for all components) or simply return a freshly loaded resource (visible only to one component).
  • Implementing the extra locator passed as an @Init attribute would not be elegant if we don't modify the AttributeBinder somehow to let the bindable do something after an attribute is bound. Two solutions that spring to my mind would be: 1) adding support for setting attributes by a setter, 2) designating an after-bound method the AttributeBinder would call. We may also want to add some simple JUnitBenchmark tests for the AttributeBinder to make sure the overhead doesn't grow too much. Yet another approach would be explicit initialization invoked in the IProcessingComponent#init() on the specific components that would then initialize the resource manager.

The linguistic branch

I used many of the ideas from this document while working on refactoring of the linguistic stuff. In the end it seems more natural to extract the lexical data component (ILexicalData) that manages resource-reloading appropriately (instead of designing another resource manager). The benefit from this is that resource reloading triggers reloading of the post-processed internal structure of linguistic resources and that this post-processed internal structure can be more naturally shared between all interested users of ILexicalDataFactory. Out of many different design strategies, this one appealed to me most. Another potential benefit I see is that it allows for easy refactoring of ResourceUtils (so that resource utils are passed as part of the context – during attribute binding). I will keep working on this on the branch.

  • No labels