Solve confusion and minor bugs around the lexical resource loading, make them coherent across all software based on the Carrot2 framework.
- Create a
ResourceManagerwould be responsible for loading all external resources, such as dictionaries, license files etc. It would maintain its own
ResourceUtilsinstance with the following locators, in the order of look up attempts:
resource-dirdirectory locator, if provided
- working directory locator
- context class loader locator
- class-relative locator
- Expose an
ResourceManagerfor one additional resource locator added at the top of the default resource locators list. The locator can itself delegate to a cascade of other locators, if needed.
- Move the
- Remove the
resource-pathattribute. Issue a warning if a value for the attribute is provided.
- Initialize locators in the
ResourceManager#init()method, which will have to be invoked explicitly from the enclosing
IProcessingComponentafter the component is initialized.
- Implement the
nullor an empty string, the file system directory look up should not be performed. If
resource-diris not empty, a warning should be issued if the path does not denote and existing directory.
ResourceManagercaches the requested
ResourceManagerloads the resource once again and puts the new version in the cache. The cache is shared among all instances of
- Remove all unnecessary references to
ResourceUtils(e.g. in tests), replace with specific
- Write useful log messages on the DEBUG level, including all locators examined, including the failing and the matching one.
- Removal of the
resource-pathattribute. The change needs to be documented in the release notes.
/. Build scripts need to be updated to put the resources at the new location, the change needs to be documented in the release notes.
- A single
ResourceManagerfor all components of an algorithm. The
ResourceManagerinstance, created in the main algorithm class (e.g. LingoClusteringAlgorithm) or some parent component (e.g.
CompletePreprocessingPipeline) will have to be passed to the appropriate subcomponents (e.g. the
DefaultLanguageModelFactory. It would be tempting to put the
ResourceManagerin the preprocessing context, but then we'd have to put the
ResourceManageris separate from the component that parses the resources it delivers, the component will have to check the
reload-resourcesflag on its own and re-parse the resource appropriately.
- How do we implement Lingo3G license loading (home directory)? Use a dedicated ResourceUtils that will first check the license-specific locations (system property, home directory) and if license is not found, delegate the search to the algorithm's
- The fact that Lingo3G is using only a part of the stemmer/tokenizer part
DefaultLanguageModelFactoryis a bit of a problem because it causes Carrot2's lexical resources to be loaded in vain and can cause errors if they're not found. One important constraint here is that Solr's clustering component overrides the default language model factory, so both Carrot2 algorithms and Lingo3G need to obtain stemmers and tokenizers from the overridden factory and not call Carrot2's default factories directly. One way to solve the problem would be to split the
ILanguageModelfactory into two interfaces and have Lingo3G use only one of them. Alternatively, we could remove the
isStopLabelaltogether and move the implementation to the places where they are needed. If we follow the latter approach, we'd still need to expose some interfaces for the methods, so that e.g. Carrot2 Solr clustering component can use Solr's lexical resources to implement them. Yet another solution would be to make the
DefaultLanguageModelFactory#createTokenizer()method public. All of the above solutions will require and update to Solr's Carrot2 clustering component.
- Should we rename the
resource-pathattribute to something more accurate, e.g.
resource-classpath-prefix? This is a little tricky because, ideally, we should keep the old attribute for some time too. In fact, now I'm thinking we could add
AttributeBinder, so that it could issue warnings if such attribute is bound.
- As an alternative to the above, we could assume the
resource-pathdenotes a filesystem path (not sure if anyone used it as a classpath, the only use I'm aware of was the misuse as a filesystem path) and do not support prefixing of the resource locations. If someone would like to read them from a specific classpath location, then they would have to add a custom locator.
- What should
reload-resourcesactually do? Should it reload the static resource cache (visible for all components) or simply return a freshly loaded resource (visible only to one component).
- Implementing the extra locator passed as an
@Initattribute would not be elegant if we don't modify the
AttributeBindersomehow to let the bindable do something after an attribute is bound. Two solutions that spring to my mind would be: 1) adding support for setting attributes by a setter, 2) designating an after-bound method the
AttributeBinderwould call. We may also want to add some simple JUnitBenchmark tests for the
AttributeBinderto make sure the overhead doesn't grow too much. Yet another approach would be explicit initialization invoked in the
IProcessingComponent#init()on the specific components that would then initialize the resource manager.
I used many of the ideas from this document while working on refactoring of the linguistic stuff. In the end it seems more natural to extract the lexical data component (ILexicalData) that manages resource-reloading appropriately (instead of designing another resource manager). The benefit from this is that resource reloading triggers reloading of the post-processed internal structure of linguistic resources and that this post-processed internal structure can be more naturally shared between all interested users of ILexicalDataFactory. Out of many different design strategies, this one appealed to me most. Another potential benefit I see is that it allows for easy refactoring of ResourceUtils (so that resource utils are passed as part of the context – during attribute binding). I will keep working on this on the branch.