java.lang.Object

eu.sealsproject.platform.res.tool.impl.AbstractPlugin

de.uni_mannheim.informatik.dws.melt.matching_ml.python.nlptransformers.TransformersBase

All Implemented Interfaces:: IMatcher<org.apache.jena.ontology.OntModel,Alignment,Properties>, eu.sealsproject.platform.res.domain.omt.IOntologyMatchingToolBridge, eu.sealsproject.platform.res.tool.api.IPlugin, eu.sealsproject.platform.res.tool.api.IToolBridge

Direct Known Subclasses:: LLMBase, SentenceTransformersMatcher, TransformersBaseFineTuner, TransformersFilter

public abstract class TransformersBase extends MatcherYAAAJena

This is a base class for all Transformers. It just contains some variables and getter and setters.

Field Summary

Fields

Modifier and Type

Field

Description

protected String

cudaVisibleDevices

protected TextExtractorMap

extractor

private static final org.slf4j.Logger

LOGGER

private static final String

LOREM_IPSUM

protected String

modelName

protected boolean

multipleTextsToMultipleExamples

protected TransformersMultiProcessing

multiProcessing

private static final Pattern

SPLIT_WORDS

protected TransformersArguments

trainingArguments

protected File

transformersCache

protected boolean

usingTensorflow

Fields inherited from class de.uni_mannheim.informatik.dws.melt.matching_base.MatcherFile
FILE_PREFIX, FILE_SUFFIX
Constructor Summary

Constructors

Constructor

Description

TransformersBase(TextExtractorMap extractor, String modelName)

Constructor with all required parameters.

TransformersBase(TextExtractor extractor, String modelName)

Constructor with all required parameters.
Method Summary

Modifier and Type

Method

Description

void

addTrainingArgument(String key, Object value)

Adds a training argument for the transformers trainer.

private static List<String>

createLoremIpsum(int numberOfExamples)

String

getCudaVisibleDevices()

Returns a string which is set to the environment variable CUDA_VISIBLE_DEVICES to select on which GPU the process should run.

protected String

getCudaVisibleDevicesButOnlyOneGPU()

protected List<String>

getExamplesForBatchSizeOptimization(File trainingFile, int numberOfExamples, BatchSizeOptimization optimization)

private static List<String>

getExamplesForBatchSizeOptimizationGivenComparator(File trainingFile, int numberOfExamples, Comparator<List<String>> comparer)

Creates examples for the batch size optimization which takes care of the csv format (in case one entity is distributed over multiple lines.

TextExtractor

getExtractor()

Returns the text extractor which extracts text from a given resource.

TextExtractorMap

getExtractorMap()

Returns the text extractor which extracts text from a given resource.

String

getModelName()

Returns the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library)

TransformersMultiProcessing

getMultiProcessing()

Returns the multiprocessing value of the transformer.

protected Map<String,Set<String>>

getTextualRepresentation(org.apache.jena.rdf.model.Resource r, Map<org.apache.jena.rdf.model.Resource,Map<String,Set<String>>> cache)

TransformersArguments

getTrainingArguments()

Returns the training arguments of the huggingface trainer.

File

getTransformersCache()

Returns the cache folder where the pretrained transformers models are stored.

boolean

isMultipleTextsToMultipleExamples()

Returns the value if all texts returned by the text extractor are used separately to generate the examples.

boolean

isOptimizeForMixedPrecisionTraining()

Returns the value if mixed precision training is enabled or diabled.

boolean

isUsingTensorflow()

Returns a boolean value if tensorflow is used to train the model.

void

setCudaVisibleDevices(int... cudaVisibleDevices)

Sets the environment variable CUDA_VISIBLE_DEVICES to select on which GPUs the process should run.

void

setCudaVisibleDevices(String cudaVisibleDevices)

Sets the environment variable CUDA_VISIBLE_DEVICES to select on which GPUs the process should run.

void

setExtractor(TextExtractor extractor)

Sets the extractor which computes the text from a given resource.

void

setExtractorMap(TextExtractorMap extractorMap)

Sets the extractor which computes the text from a given resource.

void

setModelName(String modelName)

Sets the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library).

void

setMultipleTextsToMultipleExamples(boolean multipleTextsToMultipleExamples)

Is set to true, then all texts returned by the text extractor are used separately to generate the examples.

void

setMultiProcessing(TransformersMultiProcessing multiProcessing)

Sets the multiprocessing value of the transformer.

void

setOptimizeForMixedPrecisionTraining(boolean mpt)

Enable or disable the mixed precision training.

void

setTrainingArguments(TransformersArguments configuration)

Sets the training arguments of the huggingface trainer.

void

setTransformersCache(File transformersCache)

Sets the cache folder where the pretrained transformers models are stored.

void

setUsingTensorflow(boolean usingTensorflow)

Sets the boolean value if tensorflow is used.

protected boolean

writeExamplesToFile(List<String> list, File destination, int numberOfExamples)

Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_jena.MatcherYAAAJena
getModelSpec, match, match, readOntology

Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_jena.MatcherYAAA
match

Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_base.MatcherFile
match

Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_base.MatcherURL
align, align, canExecute, getType

Methods inherited from class eu.sealsproject.platform.res.tool.impl.AbstractPlugin
getId, getVersion, setId, setVersion

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface eu.sealsproject.platform.res.tool.api.IPlugin
getId, getVersion

Field Details
- LOGGER
  
  private static final org.slf4j.Logger LOGGER
- extractor
  
  protected TextExtractorMap extractor
- modelName
  
  protected String modelName
- trainingArguments
  
  protected TransformersArguments trainingArguments
- usingTensorflow
  
  protected boolean usingTensorflow
- cudaVisibleDevices
  
  protected String cudaVisibleDevices
- transformersCache
  
  protected File transformersCache
- multiProcessing
  
  protected TransformersMultiProcessing multiProcessing
- multipleTextsToMultipleExamples
  
  protected boolean multipleTextsToMultipleExamples
- LOREM_IPSUM
  
  private static final String LOREM_IPSUM
- SPLIT_WORDS
  
  private static final Pattern SPLIT_WORDS
Constructor Details
- TransformersBase
  
  public TransformersBase(TextExtractorMap extractor, String modelName)
  
  Constructor with all required parameters.
  
  Parameters:
  
  extractor - the extractor to select which text for each resource should be used.
  
  modelName - the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library). In case of a path, it should be absolute. The path can be generated by e.g. FileUtil.getCanonicalPathIfPossible(java.io.File)
- TransformersBase
  
  public TransformersBase(TextExtractor extractor, String modelName)
  
  Constructor with all required parameters.
  
  Parameters:
  
  extractor - the extractor to select which text for each resource should be used.
  
  modelName - the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library). In case of a path, it should be absolute. The path can be generated by e.g. FileUtil.getCanonicalPathIfPossible(java.io.File)
Method Details
- getExtractor
  
  public TextExtractor getExtractor()
  
  Returns the text extractor which extracts text from a given resource. This is the text which represents a resource.
  
  Returns:
  
  the text extractor
- getExtractorMap
  
  public TextExtractorMap getExtractorMap()
  
  Returns the text extractor which extracts text from a given resource. This is the text which represents a resource.
  
  Returns:
  
  the text extractor
- setExtractor
  
  public void setExtractor(TextExtractor extractor)
  
  Sets the extractor which computes the text from a given resource. This is the text which represents a resource.
  
  Parameters:
  
  extractor - the text extractor
- setExtractorMap
  
  public void setExtractorMap(TextExtractorMap extractorMap)
  
  Sets the extractor which computes the text from a given resource. This is the map variant which also includes the given keys in the map when MultipleTextsToMultipleExamples is set to true.
  
  Parameters:
  
  extractorMap - the text extractormap
- getModelName
  
  public String getModelName()
  
  Returns the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library)
  
  Returns:
  
  the model name as a string
- setModelName
  
  public void setModelName(String modelName)
  
  Sets the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library). In case of a path, it should be abolute. The path can be generated by e.g. FileUtil.getCanonicalPathIfPossible(java.io.File)
  
  Parameters:
  
  modelName - the model name as a string
- getTrainingArguments
  
  public TransformersArguments getTrainingArguments()
  
  Returns the training arguments of the huggingface trainer. Any of the training arguments which are listed on the documentation can be used.
  
  Returns:
  
  the transformer location
- setTrainingArguments
  
  public void setTrainingArguments(TransformersArguments configuration)
  
  Sets the training arguments of the huggingface trainer. Any of the training arguments which are listed on the documentation can be used.
  
  Parameters:
  
  configuration - the trainer configuration
- addTrainingArgument
  
  public void addTrainingArgument(String key, Object value)
  
  Adds a training argument for the transformers trainer. Any of the training arguments which are listed on the documentation can be used.
  
  Parameters:
  
  key - The key of the training argument like warmup_ratio
  
  value - the corresponding value like 0.2
- isUsingTensorflow
  
  public boolean isUsingTensorflow()
  
  Returns a boolean value if tensorflow is used to train the model. If true, the models are run with tensorflow. If false, pytorch is used.
  
  Returns:
  
  true, if tensorflow is used. false, if pytorch is used.
- setUsingTensorflow
  
  public void setUsingTensorflow(boolean usingTensorflow)
  
  Sets the boolean value if tensorflow is used. If set to false, true, pytorch is used.
  
  Parameters:
  
  usingTensorflow - true to use tensorflow and false to use pytorch.
- getCudaVisibleDevicesButOnlyOneGPU
  
  protected String getCudaVisibleDevicesButOnlyOneGPU()
- getCudaVisibleDevices
  
  public String getCudaVisibleDevices()
  
  Returns a string which is set to the environment variable CUDA_VISIBLE_DEVICES to select on which GPU the process should run. If null or empty, the default is used (all available GPUs).
  
  Returns:
  
  the variable CUDA_VISIBLE_DEVICES
- setCudaVisibleDevices
  
  public void setCudaVisibleDevices(String cudaVisibleDevices)
  
  Sets the environment variable CUDA_VISIBLE_DEVICES to select on which GPUs the process should run. If null or the string is empty, the default is used (all available GPUs). If multiple GPUs can be used, then the values should be comma separated. Example: "0" to use only the first GPU. "1,3" to use the second and fourth GPU. The use of setCudaVisibleDevices(int...) is preffered because it is more type safe.
  
  Parameters:
  
  cudaVisibleDevices - the string which is set to the environment variable CUDA_VISIBLE_DEVICES
- setCudaVisibleDevices
  
  public void setCudaVisibleDevices(int... cudaVisibleDevices)
  
  Sets the environment variable CUDA_VISIBLE_DEVICES to select on which GPUs the process should run. If no values are provided, then all available GPUs are used. If multiple GPUs should be used, then provide the values one after the other. All indices are zero based. So call setCudaVisibleDevices(0,1) to use the first two GPUs.
  
  Parameters:
  
  cudaVisibleDevices - the integer numbers which refers to the GPUs which should be used.
- getTransformersCache
  
  public File getTransformersCache()
  
  Returns the cache folder where the pretrained transformers models are stored. If set to null, the default locations is used ( which is usually ~/.cache/huggingface/transformers/).
  
  Returns:
  
  the transformers cache folder.
- setTransformersCache
  
  public void setTransformersCache(File transformersCache)
  
  Sets the cache folder where the pretrained transformers models are stored. If set to null, the default locations is used ( which is usually ~/.cache/huggingface/transformers/). This setter is useful, if the default location does not have enough space available. Then just set it to a folder which have a lot of free space.
  
  Parameters:
  
  transformersCache - The transformers cache folder.
- getMultiProcessing
  
  public TransformersMultiProcessing getMultiProcessing()
  
  Returns the multiprocessing value of the transformer. The transformers library may not free all memory from GPU. Thus the prediction and training are wrapped in an external process. This enum defines how the process is started and if multiprocessing should be used at all. Default is to use the system dependent default.
  
  Returns:
  
  the enum which represent the multi process starting method.
- setMultiProcessing
  
  public void setMultiProcessing(TransformersMultiProcessing multiProcessing)
  
  Sets the multiprocessing value of the transformer. The transformers library may not free all memory from GPU. Thus the prediction and training are wrapped in an external process. This enum defines how the process is started and if multiprocessing should be used at all. Default is to use the system dependent default.
  
  Parameters:
  
  multiProcessing - the enum which represent the multi process starting method.
- setOptimizeForMixedPrecisionTraining
  
  public void setOptimizeForMixedPrecisionTraining(boolean mpt)
  
  Enable or disable the mixed precision training. This will optimize the runtime of training and
  
  Parameters:
  
  mpt - true to enable mixed precision training
- isOptimizeForMixedPrecisionTraining
  
  public boolean isOptimizeForMixedPrecisionTraining()
  
  Returns the value if mixed precision training is enabled or diabled.
  
  Returns:
  
  true if mixed precision training is enabled.
- isMultipleTextsToMultipleExamples
  
  public boolean isMultipleTextsToMultipleExamples()
  
  Returns the value if all texts returned by the text extractor are used separately to generate the examples. Otherwise it will concatenate all texts together to form one example(the default). This should be only enabled when the extractor does not return many texts because otherwise a lot of examples are produced.
  
  Returns:
  
  true, if generation of multiple examples is enabled
- setMultipleTextsToMultipleExamples
  
  public void setMultipleTextsToMultipleExamples(boolean multipleTextsToMultipleExamples)
  
  Is set to true, then all texts returned by the text extractor are used separately to generate the examples. Otherwise it will concatenate all texts together to form one example(the default). This should be only enabled when the extractor does not return many texts because otherwise a lot of examples are produced.
  
  Parameters:
  
  multipleTextsToMultipleExamples - true, to enable the generation of multiple examples.
- getTextualRepresentation
  
  protected Map<String,Set<String>> getTextualRepresentation(org.apache.jena.rdf.model.Resource r, Map<org.apache.jena.rdf.model.Resource,Map<String,Set<String>>> cache)
- getExamplesForBatchSizeOptimization
  
  protected List<String> getExamplesForBatchSizeOptimization(File trainingFile, int numberOfExamples, BatchSizeOptimization optimization)
- getExamplesForBatchSizeOptimizationGivenComparator
  
  private static List<String> getExamplesForBatchSizeOptimizationGivenComparator(File trainingFile, int numberOfExamples, Comparator<List<String>> comparer)
  
  Creates examples for the batch size optimization which takes care of the csv format (in case one entity is distributed over multiple lines.
  
  Parameters:
  
  trainingFile - the trainign file to read from
  
  numberOfExamples - number of examples to be returned
  
  comparer - the compararer (shoud fulfill the comparer interface -1 if first is smaller than second etc)
  
  Returns:
  
  the largest elements in this file as a list of strings (these are already csv formatted).
- createLoremIpsum
  
  private static List<String> createLoremIpsum(int numberOfExamples)
- writeExamplesToFile
  
  protected boolean writeExamplesToFile(List<String> list, File destination, int numberOfExamples) throws IOException
  
  Throws:
  
  IOException

Class TransformersBase

Field Summary

Fields inherited from class de.uni_mannheim.informatik.dws.melt.matching_base.MatcherFile

Constructor Summary

Method Summary

Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_jena.MatcherYAAAJena

Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_jena.MatcherYAAA

Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_base.MatcherFile

Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_base.MatcherURL

Methods inherited from class eu.sealsproject.platform.res.tool.impl.AbstractPlugin

Methods inherited from class java.lang.Object

Methods inherited from interface eu.sealsproject.platform.res.tool.api.IPlugin

Field Details

LOGGER

extractor

modelName

trainingArguments

usingTensorflow

cudaVisibleDevices

transformersCache

multiProcessing

multipleTextsToMultipleExamples

LOREM_IPSUM

SPLIT_WORDS

Constructor Details

TransformersBase

TransformersBase

Method Details

getExtractor

getExtractorMap

setExtractor

setExtractorMap

getModelName

setModelName

getTrainingArguments

setTrainingArguments

addTrainingArgument

isUsingTensorflow

setUsingTensorflow

getCudaVisibleDevicesButOnlyOneGPU

getCudaVisibleDevices

setCudaVisibleDevices

setCudaVisibleDevices

getTransformersCache

setTransformersCache

getMultiProcessing

setMultiProcessing

setOptimizeForMixedPrecisionTraining

isOptimizeForMixedPrecisionTraining

isMultipleTextsToMultipleExamples

setMultipleTextsToMultipleExamples

getTextualRepresentation

getExamplesForBatchSizeOptimization

getExamplesForBatchSizeOptimizationGivenComparator

createLoremIpsum

writeExamplesToFile