All Implemented Interfaces:
IMatcher<org.apache.jena.ontology.OntModel,Alignment,Properties>, eu.sealsproject.platform.res.domain.omt.IOntologyMatchingToolBridge, eu.sealsproject.platform.res.tool.api.IPlugin, eu.sealsproject.platform.res.tool.api.IToolBridge
Direct Known Subclasses:
LLMBase, SentenceTransformersMatcher, TransformersBaseFineTuner, TransformersFilter

public abstract class TransformersBase extends MatcherYAAAJena
This is a base class for all Transformers. It just contains some variables and getter and setters.
  • Field Details

    • LOGGER

      private static final org.slf4j.Logger LOGGER
    • extractor

      protected TextExtractorMap extractor
    • modelName

      protected String modelName
    • trainingArguments

      protected TransformersArguments trainingArguments
    • usingTensorflow

      protected boolean usingTensorflow
    • cudaVisibleDevices

      protected String cudaVisibleDevices
    • transformersCache

      protected File transformersCache
    • multiProcessing

      protected TransformersMultiProcessing multiProcessing
    • multipleTextsToMultipleExamples

      protected boolean multipleTextsToMultipleExamples
    • LOREM_IPSUM

      private static final String LOREM_IPSUM
    • SPLIT_WORDS

      private static final Pattern SPLIT_WORDS
  • Constructor Details

  • Method Details

    • getExtractor

      public TextExtractor getExtractor()
      Returns the text extractor which extracts text from a given resource. This is the text which represents a resource.
      Returns:
      the text extractor
    • getExtractorMap

      public TextExtractorMap getExtractorMap()
      Returns the text extractor which extracts text from a given resource. This is the text which represents a resource.
      Returns:
      the text extractor
    • setExtractor

      public void setExtractor(TextExtractor extractor)
      Sets the extractor which computes the text from a given resource. This is the text which represents a resource.
      Parameters:
      extractor - the text extractor
    • setExtractorMap

      public void setExtractorMap(TextExtractorMap extractorMap)
      Sets the extractor which computes the text from a given resource. This is the map variant which also includes the given keys in the map when MultipleTextsToMultipleExamples is set to true.
      Parameters:
      extractorMap - the text extractormap
    • getModelName

      public String getModelName()
      Returns the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library)
      Returns:
      the model name as a string
    • setModelName

      public void setModelName(String modelName)
      Sets the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library). In case of a path, it should be abolute. The path can be generated by e.g. FileUtil.getCanonicalPathIfPossible(java.io.File)
      Parameters:
      modelName - the model name as a string
    • getTrainingArguments

      public TransformersArguments getTrainingArguments()
      Returns the training arguments of the huggingface trainer. Any of the training arguments which are listed on the documentation can be used.
      Returns:
      the transformer location
    • setTrainingArguments

      public void setTrainingArguments(TransformersArguments configuration)
      Sets the training arguments of the huggingface trainer. Any of the training arguments which are listed on the documentation can be used.
      Parameters:
      configuration - the trainer configuration
    • addTrainingArgument

      public void addTrainingArgument(String key, Object value)
      Adds a training argument for the transformers trainer. Any of the training arguments which are listed on the documentation can be used.
      Parameters:
      key - The key of the training argument like warmup_ratio
      value - the corresponding value like 0.2
    • isUsingTensorflow

      public boolean isUsingTensorflow()
      Returns a boolean value if tensorflow is used to train the model. If true, the models are run with tensorflow. If false, pytorch is used.
      Returns:
      true, if tensorflow is used. false, if pytorch is used.
    • setUsingTensorflow

      public void setUsingTensorflow(boolean usingTensorflow)
      Sets the boolean value if tensorflow is used. If set to false, true, pytorch is used.
      Parameters:
      usingTensorflow - true to use tensorflow and false to use pytorch.
    • getCudaVisibleDevicesButOnlyOneGPU

      protected String getCudaVisibleDevicesButOnlyOneGPU()
    • getCudaVisibleDevices

      public String getCudaVisibleDevices()
      Returns a string which is set to the environment variable CUDA_VISIBLE_DEVICES to select on which GPU the process should run. If null or empty, the default is used (all available GPUs).
      Returns:
      the variable CUDA_VISIBLE_DEVICES
    • setCudaVisibleDevices

      public void setCudaVisibleDevices(String cudaVisibleDevices)
      Sets the environment variable CUDA_VISIBLE_DEVICES to select on which GPUs the process should run. If null or the string is empty, the default is used (all available GPUs). If multiple GPUs can be used, then the values should be comma separated. Example: "0" to use only the first GPU. "1,3" to use the second and fourth GPU. The use of setCudaVisibleDevices(int...) is preffered because it is more type safe.
      Parameters:
      cudaVisibleDevices - the string which is set to the environment variable CUDA_VISIBLE_DEVICES
    • setCudaVisibleDevices

      public void setCudaVisibleDevices(int... cudaVisibleDevices)
      Sets the environment variable CUDA_VISIBLE_DEVICES to select on which GPUs the process should run. If no values are provided, then all available GPUs are used. If multiple GPUs should be used, then provide the values one after the other. All indices are zero based. So call setCudaVisibleDevices(0,1) to use the first two GPUs.
      Parameters:
      cudaVisibleDevices - the integer numbers which refers to the GPUs which should be used.
    • getTransformersCache

      public File getTransformersCache()
      Returns the cache folder where the pretrained transformers models are stored. If set to null, the default locations is used ( which is usually ~/.cache/huggingface/transformers/).
      Returns:
      the transformers cache folder.
    • setTransformersCache

      public void setTransformersCache(File transformersCache)
      Sets the cache folder where the pretrained transformers models are stored. If set to null, the default locations is used ( which is usually ~/.cache/huggingface/transformers/). This setter is useful, if the default location does not have enough space available. Then just set it to a folder which have a lot of free space.
      Parameters:
      transformersCache - The transformers cache folder.
    • getMultiProcessing

      public TransformersMultiProcessing getMultiProcessing()
      Returns the multiprocessing value of the transformer. The transformers library may not free all memory from GPU. Thus the prediction and training are wrapped in an external process. This enum defines how the process is started and if multiprocessing should be used at all. Default is to use the system dependent default.
      Returns:
      the enum which represent the multi process starting method.
    • setMultiProcessing

      public void setMultiProcessing(TransformersMultiProcessing multiProcessing)
      Sets the multiprocessing value of the transformer. The transformers library may not free all memory from GPU. Thus the prediction and training are wrapped in an external process. This enum defines how the process is started and if multiprocessing should be used at all. Default is to use the system dependent default.
      Parameters:
      multiProcessing - the enum which represent the multi process starting method.
    • setOptimizeForMixedPrecisionTraining

      public void setOptimizeForMixedPrecisionTraining(boolean mpt)
      Enable or disable the mixed precision training. This will optimize the runtime of training and
      Parameters:
      mpt - true to enable mixed precision training
    • isOptimizeForMixedPrecisionTraining

      public boolean isOptimizeForMixedPrecisionTraining()
      Returns the value if mixed precision training is enabled or diabled.
      Returns:
      true if mixed precision training is enabled.
    • isMultipleTextsToMultipleExamples

      public boolean isMultipleTextsToMultipleExamples()
      Returns the value if all texts returned by the text extractor are used separately to generate the examples. Otherwise it will concatenate all texts together to form one example(the default). This should be only enabled when the extractor does not return many texts because otherwise a lot of examples are produced.
      Returns:
      true, if generation of multiple examples is enabled
    • setMultipleTextsToMultipleExamples

      public void setMultipleTextsToMultipleExamples(boolean multipleTextsToMultipleExamples)
      Is set to true, then all texts returned by the text extractor are used separately to generate the examples. Otherwise it will concatenate all texts together to form one example(the default). This should be only enabled when the extractor does not return many texts because otherwise a lot of examples are produced.
      Parameters:
      multipleTextsToMultipleExamples - true, to enable the generation of multiple examples.
    • getTextualRepresentation

      protected Map<String,Set<String>> getTextualRepresentation(org.apache.jena.rdf.model.Resource r, Map<org.apache.jena.rdf.model.Resource,Map<String,Set<String>>> cache)
    • getExamplesForBatchSizeOptimization

      protected List<String> getExamplesForBatchSizeOptimization(File trainingFile, int numberOfExamples, BatchSizeOptimization optimization)
    • getExamplesForBatchSizeOptimizationGivenComparator

      private static List<String> getExamplesForBatchSizeOptimizationGivenComparator(File trainingFile, int numberOfExamples, Comparator<List<String>> comparer)
      Creates examples for the batch size optimization which takes care of the csv format (in case one entity is distributed over multiple lines.
      Parameters:
      trainingFile - the trainign file to read from
      numberOfExamples - number of examples to be returned
      comparer - the compararer (shoud fulfill the comparer interface -1 if first is smaller than second etc)
      Returns:
      the largest elements in this file as a list of strings (these are already csv formatted).
    • createLoremIpsum

      private static List<String> createLoremIpsum(int numberOfExamples)
    • writeExamplesToFile

      protected boolean writeExamplesToFile(List<String> list, File destination, int numberOfExamples) throws IOException
      Throws:
      IOException