All Implemented Interfaces:
IMatcher<org.apache.jena.ontology.OntModel,​Alignment,​Properties>, eu.sealsproject.platform.res.domain.omt.IOntologyMatchingToolBridge, eu.sealsproject.platform.res.tool.api.IPlugin, eu.sealsproject.platform.res.tool.api.IToolBridge

public class StopwordExtraction
extends MatcherYAAAJena
Extracts corpus dependent stopwords from instances, classes and properties.
  • Field Details

    • LOGGER

      private static final org.slf4j.Logger LOGGER
    • valueExtractors

      private List<TextExtractor> valueExtractors
      Literal extractors to choose which literal/properties should be used.
    • tokenizer

      private Function<String,​Collection<String>> tokenizer
      Tokenizer function.
    • countDistinctTermsPerResource

      private boolean countDistinctTermsPerResource
      If true, counts only tokens only once (even if it appears in one literal multiple times or multiple times in different literals).
    • topNStopwords

      private int topNStopwords
      Extracts the N top most tokens as stopwords.
    • stopwordsPercentage

      private double stopwordsPercentage
      The percentage how many resources this token must have to count as a stopword. Range between zero and one.
  • Constructor Details

    • StopwordExtraction

      public StopwordExtraction​(Function<String,​Collection<String>> tokenizer, boolean countDistinctTermsPerResource, int topNStopwords, double stopwordsPercentage, List<TextExtractor> valueExtractors)
      Extracts the stopwords based on two criteria. 1) top most occurring tokens 2) percentage. It will stop if one of the two criteria is fulfilled.
      Parameters:
      tokenizer - tokenizer
      countDistinctTermsPerResource - If true, counts only tokens only once (even if it appears in one literal multiple times or multiple times in different literals).
      topNStopwords - how many stopwords to extract
      stopwordsPercentage - the percentage of how often a token should appear.
      valueExtractors - Literal extractors to choose which literal/properties should be used.
    • StopwordExtraction

      public StopwordExtraction​(Function<String,​Collection<String>> tokenizer, boolean countDistinctTermsPerResource, int topNStopwords, double stopwordsPercentage, TextExtractor... valueExtractors)
      Extracts the stopwords based on two criteria. 1) top most occurring tokens 2) percentage. It will stop if one of the two criteria is fulfilled.
      Parameters:
      tokenizer - tokenizer
      countDistinctTermsPerResource - If true, counts only tokens only once (even if it appears in one literal multiple times or multiple times in different literals).
      topNStopwords - how many stopwords to extract
      stopwordsPercentage - the percentage of how often a token should appear.
      valueExtractors - Literal extractors to choose which literal/properties should be used.
    • StopwordExtraction

      public StopwordExtraction​(Function<String,​Collection<String>> tokenizer, int topNStopwords, org.apache.jena.rdf.model.Property... properties)
      Extracts the stopwords based on the top most occuring tokens.
      Parameters:
      tokenizer - tokenizer
      topNStopwords - how many stopwords to extract
      properties - the properties which should be used for extracting the literals (text).
    • StopwordExtraction

      public StopwordExtraction​(Function<String,​Collection<String>> tokenizer, double stopwordsPercentage, org.apache.jena.rdf.model.Property... properties)
      Extracts the stopwords based on the percentage (should be between 0 and 1). E.g. a token is a stopword if it occurs in more than 3 percent (0.03) of all resources.
      Parameters:
      tokenizer - tokenizer
      stopwordsPercentage - the percentage of how often a token should appear.
      properties - the properties which should be used for extracting the literals (text).
  • Method Details

    • match

      public Alignment match​(org.apache.jena.ontology.OntModel source, org.apache.jena.ontology.OntModel target, Alignment inputAlignment, Properties properties) throws Exception
      Description copied from class: MatcherYAAAJena
      Aligns two ontologies specified via a Jena OntModel, with an input alignment as Alignment object, and returns the mapping of the resulting alignment. Note: This method might be called multiple times in a row when using the evaluation framework. Make sure to return a mapping which is specific to the given inputs.
      Specified by:
      match in interface IMatcher<org.apache.jena.ontology.OntModel,​Alignment,​Properties>
      Specified by:
      match in class MatcherYAAAJena
      Parameters:
      source - This OntModel represents the source ontology.
      target - This OntModel represents the target ontology.
      inputAlignment - This mapping represents the input alignment.
      properties - Additional properties.
      Returns:
      The resulting alignment of the matching process.
      Throws:
      Exception - Any exception which occurs during matching.
    • storeExtractedStopwords

      public void storeExtractedStopwords​(Iterable<? extends org.apache.jena.rdf.model.Resource> resources, String key)
    • storeExtractedStopwords

      public void storeExtractedStopwords​(Iterator<? extends org.apache.jena.rdf.model.Resource> resources, String key)
    • extractStopwords

      public Set<String> extractStopwords​(Iterable<? extends org.apache.jena.rdf.model.Resource> resources)
    • extractStopwords

      public Set<String> extractStopwords​(Iterator<? extends org.apache.jena.rdf.model.Resource> resources)