Class TransformersBase
java.lang.Object
eu.sealsproject.platform.res.tool.impl.AbstractPlugin
de.uni_mannheim.informatik.dws.melt.matching_base.MatcherURL
de.uni_mannheim.informatik.dws.melt.matching_base.MatcherFile
de.uni_mannheim.informatik.dws.melt.matching_jena.MatcherYAAA
de.uni_mannheim.informatik.dws.melt.matching_jena.MatcherYAAAJena
de.uni_mannheim.informatik.dws.melt.matching_ml.python.nlptransformers.TransformersBase
- All Implemented Interfaces:
IMatcher<org.apache.jena.ontology.OntModel,
,Alignment, Properties> eu.sealsproject.platform.res.domain.omt.IOntologyMatchingToolBridge
,eu.sealsproject.platform.res.tool.api.IPlugin
,eu.sealsproject.platform.res.tool.api.IToolBridge
- Direct Known Subclasses:
LLMBase
,SentenceTransformersMatcher
,TransformersBaseFineTuner
,TransformersFilter
This is a base class for all Transformers.
It just contains some variables and getter and setters.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected String
protected TextExtractorMap
private static final org.slf4j.Logger
private static final String
protected String
protected boolean
protected TransformersMultiProcessing
private static final Pattern
protected TransformersArguments
protected File
protected boolean
Fields inherited from class de.uni_mannheim.informatik.dws.melt.matching_base.MatcherFile
FILE_PREFIX, FILE_SUFFIX
-
Constructor Summary
ConstructorsConstructorDescriptionTransformersBase
(TextExtractorMap extractor, String modelName) Constructor with all required parameters.TransformersBase
(TextExtractor extractor, String modelName) Constructor with all required parameters. -
Method Summary
Modifier and TypeMethodDescriptionvoid
addTrainingArgument
(String key, Object value) Adds a training argument for the transformers trainer.createLoremIpsum
(int numberOfExamples) Returns a string which is set to the environment variable CUDA_VISIBLE_DEVICES to select on which GPU the process should run.protected String
getExamplesForBatchSizeOptimization
(File trainingFile, int numberOfExamples, BatchSizeOptimization optimization) getExamplesForBatchSizeOptimizationGivenComparator
(File trainingFile, int numberOfExamples, Comparator<List<String>> comparer) Creates examples for the batch size optimization which takes care of the csv format (in case one entity is distributed over multiple lines.Returns the text extractor which extracts text from a given resource.Returns the text extractor which extracts text from a given resource.Returns the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library)Returns the multiprocessing value of the transformer.getTextualRepresentation
(org.apache.jena.rdf.model.Resource r, Map<org.apache.jena.rdf.model.Resource, Map<String, Set<String>>> cache) Returns the training arguments of the huggingface trainer.Returns the cache folder where the pretrained transformers models are stored.boolean
Returns the value if all texts returned by the text extractor are used separately to generate the examples.boolean
Returns the value if mixed precision training is enabled or diabled.boolean
Returns a boolean value if tensorflow is used to train the model.void
setCudaVisibleDevices
(int... cudaVisibleDevices) Sets the environment variable CUDA_VISIBLE_DEVICES to select on which GPUs the process should run.void
setCudaVisibleDevices
(String cudaVisibleDevices) Sets the environment variable CUDA_VISIBLE_DEVICES to select on which GPUs the process should run.void
setExtractor
(TextExtractor extractor) Sets the extractor which computes the text from a given resource.void
setExtractorMap
(TextExtractorMap extractorMap) Sets the extractor which computes the text from a given resource.void
setModelName
(String modelName) Sets the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library).void
setMultipleTextsToMultipleExamples
(boolean multipleTextsToMultipleExamples) Is set to true, then all texts returned by the text extractor are used separately to generate the examples.void
setMultiProcessing
(TransformersMultiProcessing multiProcessing) Sets the multiprocessing value of the transformer.void
setOptimizeForMixedPrecisionTraining
(boolean mpt) Enable or disable the mixed precision training.void
setTrainingArguments
(TransformersArguments configuration) Sets the training arguments of the huggingface trainer.void
setTransformersCache
(File transformersCache) Sets the cache folder where the pretrained transformers models are stored.void
setUsingTensorflow
(boolean usingTensorflow) Sets the boolean value if tensorflow is used.protected boolean
writeExamplesToFile
(List<String> list, File destination, int numberOfExamples) Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_jena.MatcherYAAAJena
getModelSpec, match, match, readOntology
Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_jena.MatcherYAAA
match
Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_base.MatcherFile
match
Methods inherited from class de.uni_mannheim.informatik.dws.melt.matching_base.MatcherURL
align, align, canExecute, getType
Methods inherited from class eu.sealsproject.platform.res.tool.impl.AbstractPlugin
getId, getVersion, setId, setVersion
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface eu.sealsproject.platform.res.tool.api.IPlugin
getId, getVersion
-
Field Details
-
LOGGER
private static final org.slf4j.Logger LOGGER -
extractor
-
modelName
-
trainingArguments
-
usingTensorflow
protected boolean usingTensorflow -
cudaVisibleDevices
-
transformersCache
-
multiProcessing
-
multipleTextsToMultipleExamples
protected boolean multipleTextsToMultipleExamples -
LOREM_IPSUM
-
SPLIT_WORDS
-
-
Constructor Details
-
TransformersBase
Constructor with all required parameters.- Parameters:
extractor
- the extractor to select which text for each resource should be used.modelName
- the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library). In case of a path, it should be absolute. The path can be generated by e.g.FileUtil.getCanonicalPathIfPossible(java.io.File)
-
TransformersBase
Constructor with all required parameters.- Parameters:
extractor
- the extractor to select which text for each resource should be used.modelName
- the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library). In case of a path, it should be absolute. The path can be generated by e.g.FileUtil.getCanonicalPathIfPossible(java.io.File)
-
-
Method Details
-
getExtractor
Returns the text extractor which extracts text from a given resource. This is the text which represents a resource.- Returns:
- the text extractor
-
getExtractorMap
Returns the text extractor which extracts text from a given resource. This is the text which represents a resource.- Returns:
- the text extractor
-
setExtractor
Sets the extractor which computes the text from a given resource. This is the text which represents a resource.- Parameters:
extractor
- the text extractor
-
setExtractorMap
Sets the extractor which computes the text from a given resource. This is the map variant which also includes the given keys in the map when MultipleTextsToMultipleExamples is set to true.- Parameters:
extractorMap
- the text extractormap
-
getModelName
Returns the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library)- Returns:
- the model name as a string
-
setModelName
Sets the model name which can be a model id (a hosted model on huggingface.co) or a path to a directory containing a model and tokenizer ( see first parameter pretrained_model_name_or_path of the from_pretrained function in huggingface library). In case of a path, it should be abolute. The path can be generated by e.g.FileUtil.getCanonicalPathIfPossible(java.io.File)
- Parameters:
modelName
- the model name as a string
-
getTrainingArguments
Returns the training arguments of the huggingface trainer. Any of the training arguments which are listed on the documentation can be used.- Returns:
- the transformer location
-
setTrainingArguments
Sets the training arguments of the huggingface trainer. Any of the training arguments which are listed on the documentation can be used.- Parameters:
configuration
- the trainer configuration
-
addTrainingArgument
Adds a training argument for the transformers trainer. Any of the training arguments which are listed on the documentation can be used.- Parameters:
key
- The key of the training argument like warmup_ratiovalue
- the corresponding value like 0.2
-
isUsingTensorflow
public boolean isUsingTensorflow()Returns a boolean value if tensorflow is used to train the model. If true, the models are run with tensorflow. If false, pytorch is used.- Returns:
- true, if tensorflow is used. false, if pytorch is used.
-
setUsingTensorflow
public void setUsingTensorflow(boolean usingTensorflow) Sets the boolean value if tensorflow is used. If set to false, true, pytorch is used.- Parameters:
usingTensorflow
- true to use tensorflow and false to use pytorch.
-
getCudaVisibleDevicesButOnlyOneGPU
-
getCudaVisibleDevices
Returns a string which is set to the environment variable CUDA_VISIBLE_DEVICES to select on which GPU the process should run. If null or empty, the default is used (all available GPUs).- Returns:
- the variable CUDA_VISIBLE_DEVICES
-
setCudaVisibleDevices
Sets the environment variable CUDA_VISIBLE_DEVICES to select on which GPUs the process should run. If null or the string is empty, the default is used (all available GPUs). If multiple GPUs can be used, then the values should be comma separated. Example: "0" to use only the first GPU. "1,3" to use the second and fourth GPU. The use ofsetCudaVisibleDevices(int...)
is preffered because it is more type safe.- Parameters:
cudaVisibleDevices
- the string which is set to the environment variable CUDA_VISIBLE_DEVICES
-
setCudaVisibleDevices
public void setCudaVisibleDevices(int... cudaVisibleDevices) Sets the environment variable CUDA_VISIBLE_DEVICES to select on which GPUs the process should run. If no values are provided, then all available GPUs are used. If multiple GPUs should be used, then provide the values one after the other. All indices are zero based. So callsetCudaVisibleDevices(0,1)
to use the first two GPUs.- Parameters:
cudaVisibleDevices
- the integer numbers which refers to the GPUs which should be used.
-
getTransformersCache
Returns the cache folder where the pretrained transformers models are stored. If set to null, the default locations is used ( which is usually ~/.cache/huggingface/transformers/).- Returns:
- the transformers cache folder.
-
setTransformersCache
Sets the cache folder where the pretrained transformers models are stored. If set to null, the default locations is used ( which is usually ~/.cache/huggingface/transformers/). This setter is useful, if the default location does not have enough space available. Then just set it to a folder which have a lot of free space.- Parameters:
transformersCache
- The transformers cache folder.
-
getMultiProcessing
Returns the multiprocessing value of the transformer. The transformers library may not free all memory from GPU. Thus the prediction and training are wrapped in an external process. This enum defines how the process is started and if multiprocessing should be used at all. Default is to use the system dependent default.- Returns:
- the enum which represent the multi process starting method.
-
setMultiProcessing
Sets the multiprocessing value of the transformer. The transformers library may not free all memory from GPU. Thus the prediction and training are wrapped in an external process. This enum defines how the process is started and if multiprocessing should be used at all. Default is to use the system dependent default.- Parameters:
multiProcessing
- the enum which represent the multi process starting method.
-
setOptimizeForMixedPrecisionTraining
public void setOptimizeForMixedPrecisionTraining(boolean mpt) Enable or disable the mixed precision training. This will optimize the runtime of training and- Parameters:
mpt
- true to enable mixed precision training
-
isOptimizeForMixedPrecisionTraining
public boolean isOptimizeForMixedPrecisionTraining()Returns the value if mixed precision training is enabled or diabled.- Returns:
- true if mixed precision training is enabled.
-
isMultipleTextsToMultipleExamples
public boolean isMultipleTextsToMultipleExamples()Returns the value if all texts returned by the text extractor are used separately to generate the examples. Otherwise it will concatenate all texts together to form one example(the default). This should be only enabled when the extractor does not return many texts because otherwise a lot of examples are produced.- Returns:
- true, if generation of multiple examples is enabled
-
setMultipleTextsToMultipleExamples
public void setMultipleTextsToMultipleExamples(boolean multipleTextsToMultipleExamples) Is set to true, then all texts returned by the text extractor are used separately to generate the examples. Otherwise it will concatenate all texts together to form one example(the default). This should be only enabled when the extractor does not return many texts because otherwise a lot of examples are produced.- Parameters:
multipleTextsToMultipleExamples
- true, to enable the generation of multiple examples.
-
getTextualRepresentation
-
getExamplesForBatchSizeOptimization
protected List<String> getExamplesForBatchSizeOptimization(File trainingFile, int numberOfExamples, BatchSizeOptimization optimization) -
getExamplesForBatchSizeOptimizationGivenComparator
private static List<String> getExamplesForBatchSizeOptimizationGivenComparator(File trainingFile, int numberOfExamples, Comparator<List<String>> comparer) Creates examples for the batch size optimization which takes care of the csv format (in case one entity is distributed over multiple lines.- Parameters:
trainingFile
- the trainign file to read fromnumberOfExamples
- number of examples to be returnedcomparer
- the compararer (shoud fulfill the comparer interface -1 if first is smaller than second etc)- Returns:
- the largest elements in this file as a list of strings (these are already csv formatted).
-
createLoremIpsum
-
writeExamplesToFile
protected boolean writeExamplesToFile(List<String> list, File destination, int numberOfExamples) throws IOException - Throws:
IOException
-