Class StopwordExtraction
java.lang.Object
eu.sealsproject.platform.res.tool.impl.AbstractPlugin
de.uni_mannheim.informatik.dws.melt.matching_base.MatcherURL
de.uni_mannheim.informatik.dws.melt.matching_base.MatcherFile
de.uni_mannheim.informatik.dws.melt.matching_jena.MatcherYAAA
de.uni_mannheim.informatik.dws.melt.matching_jena.MatcherYAAAJena
de.uni_mannheim.informatik.dws.melt.matching_jena_matchers.elementlevel.StopwordExtraction
- All Implemented Interfaces:
IMatcher<org.apache.jena.ontology.OntModel,
,Alignment, Properties> eu.sealsproject.platform.res.domain.omt.IOntologyMatchingToolBridge
,eu.sealsproject.platform.res.tool.api.IPlugin
,eu.sealsproject.platform.res.tool.api.IToolBridge
Extracts corpus dependent stopwords from instances, classes and properties.
-
Field Summary
Modifier and TypeFieldDescriptionprivate boolean
If true, counts only tokens only once (even if it appears in one literal multiple times or multiple times in different literals).private static final org.slf4j.Logger
private double
The percentage how many resources this token must have to count as a stopword.private Function<String,
Collection<String>> Tokenizer function.private int
Extracts the N top most tokens as stopwords.private List<TextExtractor>
Literal extractors to choose which literal/properties should be used.Fields inherited from class de.uni_mannheim.informatik.dws.melt.matching_base.MatcherFile
FILE_PREFIX, FILE_SUFFIX
-
Constructor Summary
ConstructorDescriptionStopwordExtraction
(Function<String, Collection<String>> tokenizer, boolean countDistinctTermsPerResource, int topNStopwords, double stopwordsPercentage, TextExtractor... valueExtractors) Extracts the stopwords based on two criteria.StopwordExtraction
(Function<String, Collection<String>> tokenizer, boolean countDistinctTermsPerResource, int topNStopwords, double stopwordsPercentage, List<TextExtractor> valueExtractors) Extracts the stopwords based on two criteria.StopwordExtraction
(Function<String, Collection<String>> tokenizer, double stopwordsPercentage, org.apache.jena.rdf.model.Property... properties) Extracts the stopwords based on the percentage (should be between 0 and 1).StopwordExtraction
(Function<String, Collection<String>> tokenizer, int topNStopwords, org.apache.jena.rdf.model.Property... properties) Extracts the stopwords based on the top most occuring tokens. -
Method Summary
Modifier and TypeMethodDescriptionextractStopwords
(Iterable<? extends org.apache.jena.rdf.model.Resource> resources) extractStopwords
(Iterator<? extends org.apache.jena.rdf.model.Resource> resources) match
(org.apache.jena.ontology.OntModel source, org.apache.jena.ontology.OntModel target, Alignment inputAlignment, Properties properties) Aligns two ontologies specified via a Jena OntModel, with an input alignment as Alignment object, and returns the mapping of the resulting alignment.void
storeExtractedStopwords
(Iterable<? extends org.apache.jena.rdf.model.Resource> resources, String key) void
storeExtractedStopwords
(Iterator<? extends org.apache.jena.rdf.model.Resource> resources, String key) Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_jena.MatcherYAAAJena
getModelSpec, match, readOntology
Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_jena.MatcherYAAA
match
Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_base.MatcherFile
match
Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_base.MatcherURL
align, align, canExecute, getType
Methods inherited from class eu.sealsproject.platform.res.tool.impl.AbstractPlugin
getId, getVersion, setId, setVersion
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface eu.sealsproject.platform.res.tool.api.IPlugin
getId, getVersion
-
Field Details
-
LOGGER
private static final org.slf4j.Logger LOGGER -
valueExtractors
Literal extractors to choose which literal/properties should be used. -
tokenizer
Tokenizer function. -
countDistinctTermsPerResource
private boolean countDistinctTermsPerResourceIf true, counts only tokens only once (even if it appears in one literal multiple times or multiple times in different literals). -
topNStopwords
private int topNStopwordsExtracts the N top most tokens as stopwords. -
stopwordsPercentage
private double stopwordsPercentageThe percentage how many resources this token must have to count as a stopword. Range between zero and one.
-
-
Constructor Details
-
StopwordExtraction
public StopwordExtraction(Function<String, Collection<String>> tokenizer, boolean countDistinctTermsPerResource, int topNStopwords, double stopwordsPercentage, List<TextExtractor> valueExtractors) Extracts the stopwords based on two criteria. 1) top most occurring tokens 2) percentage. It will stop if one of the two criteria is fulfilled.- Parameters:
tokenizer
- tokenizercountDistinctTermsPerResource
- If true, counts only tokens only once (even if it appears in one literal multiple times or multiple times in different literals).topNStopwords
- how many stopwords to extractstopwordsPercentage
- the percentage of how often a token should appear.valueExtractors
- Literal extractors to choose which literal/properties should be used.
-
StopwordExtraction
public StopwordExtraction(Function<String, Collection<String>> tokenizer, boolean countDistinctTermsPerResource, int topNStopwords, double stopwordsPercentage, TextExtractor... valueExtractors) Extracts the stopwords based on two criteria. 1) top most occurring tokens 2) percentage. It will stop if one of the two criteria is fulfilled.- Parameters:
tokenizer
- tokenizercountDistinctTermsPerResource
- If true, counts only tokens only once (even if it appears in one literal multiple times or multiple times in different literals).topNStopwords
- how many stopwords to extractstopwordsPercentage
- the percentage of how often a token should appear.valueExtractors
- Literal extractors to choose which literal/properties should be used.
-
StopwordExtraction
public StopwordExtraction(Function<String, Collection<String>> tokenizer, int topNStopwords, org.apache.jena.rdf.model.Property... properties) Extracts the stopwords based on the top most occuring tokens.- Parameters:
tokenizer
- tokenizertopNStopwords
- how many stopwords to extractproperties
- the properties which should be used for extracting the literals (text).
-
StopwordExtraction
public StopwordExtraction(Function<String, Collection<String>> tokenizer, double stopwordsPercentage, org.apache.jena.rdf.model.Property... properties) Extracts the stopwords based on the percentage (should be between 0 and 1). E.g. a token is a stopword if it occurs in more than 3 percent (0.03) of all resources.- Parameters:
tokenizer
- tokenizerstopwordsPercentage
- the percentage of how often a token should appear.properties
- the properties which should be used for extracting the literals (text).
-
-
Method Details
-
match
public Alignment match(org.apache.jena.ontology.OntModel source, org.apache.jena.ontology.OntModel target, Alignment inputAlignment, Properties properties) throws Exception Description copied from class:MatcherYAAAJena
Aligns two ontologies specified via a Jena OntModel, with an input alignment as Alignment object, and returns the mapping of the resulting alignment. Note: This method might be called multiple times in a row when using the evaluation framework. Make sure to return a mapping which is specific to the given inputs.- Specified by:
match
in interfaceIMatcher<org.apache.jena.ontology.OntModel,
Alignment, Properties> - Specified by:
match
in classMatcherYAAAJena
- Parameters:
source
- This OntModel represents the source ontology.target
- This OntModel represents the target ontology.inputAlignment
- This mapping represents the input alignment.properties
- Additional properties.- Returns:
- The resulting alignment of the matching process.
- Throws:
Exception
- Any exception which occurs during matching.
-
storeExtractedStopwords
-
storeExtractedStopwords
-
extractStopwords
-
extractStopwords
-