java.lang.Object
de.uni_mannheim.informatik.dws.melt.matching_ml.python.PythonServer

public class PythonServer extends Object
A client class to communicate with python libraries such as gensim. This class follows a singleton pattern. Communication is performed through HTTP requests. In case you need a different python environment or python executable, create a file in directory python_server named python_command.txt and write your absolute path of the python executable in that file.
  • Field Details

    • LOGGER

      private static final org.slf4j.Logger LOGGER
      Default logger
    • DEFAULT_RESOURCES_DIRECTORY

      private static final String DEFAULT_RESOURCES_DIRECTORY
      Default resources directory (where the python files will be copied to by default) and where the resources are read from within the JAR.
      See Also:
    • JSON_MAPPER

      private static final com.fasterxml.jackson.databind.ObjectMapper JSON_MAPPER
      ObjectMapper from jackson to generate JSON.
    • serverUrl

      private static String serverUrl
      The URL that shall be used to perform the requests.
    • isVectorCaching

      private boolean isVectorCaching
      Indicator whether vectors shall be cached. This means that vectors are cached locally and similarities are calculated in Java to avoid many cross-language calls. Disable in cases of infrequent calls or if memory availability is limited.
    • isShutDown

      private static boolean isShutDown
      Indicates whether the server has been shut down. Initial state: shutDown.
    • vectorCache

      private HashMap<String,Double[]> vectorCache
      Local vector cache.
    • isHookStarted

      private boolean isHookStarted
      Indicates whether the shutdown hook has been initialized. This flag is required in order to have only one hook despite multiple re-initializations.
    • resourcesDirectory

      private File resourcesDirectory
      The directory where the python files will be copied to.
    • DEFAULT_PORT

      private static final int DEFAULT_PORT
      Developer note: Do not change the default port since other applications rely on it (e.g. the python tests). Rather user setPort(int) if you need to change the port in certain cases.
      See Also:
    • port

      private static int port
      The port that shall be used.
    • pythonCommandBackup

      private static String pythonCommandBackup
      In case someone wants to configure the python command programmatically. Precedence always has the external file.
    • overridePythonFiles

      private static boolean overridePythonFiles
      If set to true, all python files (e.g. python server melt and requirements.txt file) will be overridden with every execution. Set it to false for testing and debugging new features in python server.
    • instance

      private static PythonServer instance
      Instance (singleton pattern.
    • httpClient

      private static org.apache.http.impl.client.CloseableHttpClient httpClient
      Client to communicate with the server.
    • serverProcess

      private static Process serverProcess
      The python process.
  • Constructor Details

    • PythonServer

      private PythonServer()
      Constructor
  • Method Details

    • transformersFineTuningHpSearch

      public void transformersFineTuningHpSearch(TransformersFineTunerHpSearch hpsearch, File trainingFile) throws PythonServerException
      Run a hyperparameter fine tuning.
      Parameters:
      hpsearch - the hyper parameter search model to use
      trainingFile - path to csv file with three columns (text left, text right, label 1/0).
      Throws:
      PythonServerException - in case something goes wrong.
    • transformersFineTuning

      public void transformersFineTuning(TransformersFineTuner fineTuner, File trainingFile) throws PythonServerException
      Finetune a transformers model with the given parameters and write this model to a given folder.
      Parameters:
      fineTuner - the finetuner to use
      trainingFile - path to csv file with three columns (text left, text right, label 1/0).
      Throws:
      PythonServerException - in case something goes wrong.
    • transformersPrediction

      public List<Double> transformersPrediction(TransformersFilter filter, File predictionFilePath) throws PythonServerException
      Run a transformers model on a CSV file with two columns (text left and text right) to predict if they describe the same concept.
      Parameters:
      filter - the filter
      predictionFilePath - path to csv file with two columns (text left and text right).
      Returns:
      a list of confidences
      Throws:
      PythonServerException - in case something goes wrong.
    • sentenceTransformersPrediction

      public Alignment sentenceTransformersPrediction(SentenceTransformersMatcher matcher, File corpusFile, File queriesFile) throws PythonServerException
      Run sentence transformers prediction.
      Parameters:
      matcher - the matcher
      corpusFile - path to csv file with two columns (url, text representation).
      queriesFile - path to csv file with two columns (url, text representation).
      Returns:
      the newly generated alignment
      Throws:
      PythonServerException - in case something goes wrong.
    • sentenceTransformersFineTuning

      public float sentenceTransformersFineTuning(SentenceTransformersFineTuner fineTuner, File trainingFile, File validationFile) throws PythonServerException
      Run fine tuning for sentence transformers.
      Parameters:
      fineTuner - the matcher
      trainingFile - path to csv file with three columns (text left, text right, label 1/0).
      validationFile - the path to the validation file - can also be null to use train test split of trainings file.
      Returns:
      the best score of the validation (using the file or train test split).
      Throws:
      PythonServerException - in case something goes wrong.
    • transformersFineTunerUpdateBaseRequest

      private void transformersFineTunerUpdateBaseRequest(TransformersBaseFineTuner fineTuner, File trainingFile, org.apache.http.client.methods.HttpGet request)
    • transformersUpdateBaseRequest

      private void transformersUpdateBaseRequest(TransformersBase base, org.apache.http.client.methods.HttpGet request)
    • runOpenEAModel

      public void runOpenEAModel(File argumentFile, boolean save) throws Exception
      Run the openEA library.
      Parameters:
      argumentFile - the argument file to use
      save - saves the embeddings to files
      Throws:
      Exception - in case something goes wrong.
    • learnAndApplyMLModel

      public List<Integer> learnAndApplyMLModel(File trainFile, File predictFile, int cv, int jobs) throws Exception
      Learn a ML model for a given training file. This file should be comma separated and containing a header. The class attribute should be named "target".
      Parameters:
      trainFile - the train file
      predictFile - the file to predict
      cv - number of cross validations
      jobs - number of parallel jobs to run
      Returns:
      a list of double
      Throws:
      Exception - throws exception in case of errors
    • trainAndStoreMLModel

      public void trainAndStoreMLModel(File trainFile, File modelFile, int cv, int jobs) throws Exception
      Learn a ML model for a given training file and stores it in the given model file. The training file should be comma separated and containing a header. The class attribute should be named "target".
      Parameters:
      trainFile - the train file
      modelFile - where to store the model
      cv - number of cross validations
      jobs - number of parallel jobs to run
      Throws:
      Exception - throws exception in case of errors
    • applyStoredMLModel

      public List<Integer> applyStoredMLModel(File modelFile, File predictFile) throws Exception
      Apply a stored model to a new file (predict file).
      Parameters:
      predictFile - the predict file
      modelFile - where to store the model
      Returns:
      a list of integers which represents the classes
      Throws:
      Exception - throws exception in case of errors
    • alignModel

      public Alignment alignModel(String vectorPathSource, String vectorPathTarget, String function, Alignment alignment) throws Exception
      Align two knowledge graph embeddings
      Parameters:
      vectorPathSource - the source path to a vector file
      vectorPathTarget - the target path to a vector file
      function - function which is used to translate the embeddings
      alignment - the alignment with initial mapping
      Returns:
      alignment
      Throws:
      Exception - in case of errors
    • parseJSON

      private Alignment parseJSON(String resultString) throws Exception
      Throws:
      Exception
    • trainVectorSpaceModel

      public void trainVectorSpaceModel(String modelPath, String trainingFilePath)
      Method to train a vector space model. The file for the training (i.e., csv file where first column is id and second column text) has to exist already.
      Parameters:
      modelPath - identifier for the model (used for querying a specific model
      trainingFilePath - The file path to the file that shall be used for training.
    • queryVectorSpaceModel

      public double queryVectorSpaceModel(String modelPath, String documentIdOne, String documentIdTwo) throws Exception
      Method to query a vector space model (which has to be trained with trainVectorSpaceModel).
      Parameters:
      modelPath - identifier for the model (used for querying a specific model
      documentIdOne - Document id for the first document
      documentIdTwo - Document id for the second document
      Returns:
      The cosine similarity in the vector space between the two documents.
      Throws:
      Exception - Thrown if there are server problems.
    • queryVectorSpaceModel

      public List<Double> queryVectorSpaceModel(String modelPath, List<Correspondence> alignment) throws Exception
      Method to query a vector space model (which has to be trained with trainVectorSpaceModel) in a batch mode.
      Parameters:
      modelPath - identifier for the model (used for querying a specific model
      alignment - the alignment which contains the source and target uris
      Returns:
      The cosine similarities in the vector space between the requested documents in the same order .
      Throws:
      Exception - Thrown if there are server problems.
    • queryVectorSpaceModel

      public Alignment queryVectorSpaceModel(String modelPath, Alignment alignment) throws Exception
      Method to query a vector space model (which has to be trained with trainVectorSpaceModel) in a batch mode.
      Parameters:
      modelPath - identifier for the model (used for querying a specific model
      alignment - the alignment which contains the source and target uris
      Returns:
      The alignment where the confidence is updated if possible
      Throws:
      Exception - Thrown if there are server problems.
    • trainDoc2VecModel

      public void trainDoc2VecModel(String modelPath, String trainingFilePath, Word2VecConfiguration configuration)
      Method to train a doc2vec model. The file for the training (i.e., csv file where first column is id and second colum text) has to exist already.
      Parameters:
      modelPath - identifier for the model (used for querying a specific model
      trainingFilePath - The file path to the file that shall be used for training.
      configuration - the configuration for the doc2vec model
    • queryDoc2VecModel

      public List<Double> queryDoc2VecModel(String modelPath, List<Correspondence> alignment) throws Exception
      Method to query a doc2vec model (which has to be trained with trainDoc2VecModel) in a batch mode.
      Parameters:
      modelPath - identifier for the model (used for querying a specific model
      alignment - the alignment which contains the source and target uris
      Returns:
      The cosine similarities in the doc2vec space between the requested documents in the same order .
      Throws:
      Exception - Thrown if there are server problems.
    • trainWord2VecModel

      public boolean trainWord2VecModel(String modelOrVectorPath, String trainingFilePath, Word2VecConfiguration configuration)
      Method to train a word2vec model. The file for the training (i.e., file with sentences, paths etc.) has to exist already.
      Parameters:
      modelOrVectorPath - If a vector file is desired, the file ending '.kv' is required.
      trainingFilePath - The file path to the file that shall be used for training or to the directory containing the files that shall be used.
      configuration - The configuration for the training operation.
      Returns:
      True if training succeeded, else false.
    • getSimilarity

      public double getSimilarity(String concept1, String concept2, String modelOrVectorPath)
      Ge the similarity given 2 concepts and a gensim model.
      Parameters:
      concept1 - First concept.
      concept2 - Second concept.
      modelOrVectorPath - The path to the model or vector file. Note that the vector file MUST end with .kv in order to be recognized as vector file.
      Returns:
      -1.0 in case of failure, else similarity.
    • getVector

      public Double[] getVector(String concept, String modelOrVectorPath)
      Returns the vector of a concept.
      Parameters:
      concept - The concept for which the vector shall be obtained.
      modelOrVectorPath - The model path or vector file path leading to the file to be used.
      Returns:
      The vector for the specified concept.
    • isInVocabulary

      public boolean isInVocabulary(String concept, File modelOrVectorPath)
      Returns true when the concept can be found in the vocabulary of the model.
      Parameters:
      concept - The concept/URI that shall be looked up.
      modelOrVectorPath - The model or vector file. Note that the vector file MUST end with .kv in order to be recognized as vector file.
      Returns:
      True if exists, else false.
    • isInVocabulary

      public boolean isInVocabulary(String concept, String modelOrVectorPath)
      Returns true when the concept can be found in the vocabulary of the model.
      Parameters:
      concept - The concept/URI that shall be looked up.
      modelOrVectorPath - The path to the model or vector file. Note that the vector file MUST end with .kv in order to be recognized as vector file.
      Returns:
      True if exists, else false.
    • getVocabularyTerms

      public Set<String> getVocabularyTerms(String modelOrVectorPath)
      Returns the full vocabulary of the specified model as HashSet (e.g. for fast indexing). Be aware that this operation can be very memory-consuming for very large models.

      Note: If you want to just check whether a concept exists in the vocabulary, it is better to call isInVocabulary(String, String).Note further that you do not need to build your own cache if the PythonServer has enabled vector caching (you can check this with isVectorCaching().

      Parameters:
      modelOrVectorPath - The path to the model or vector file. Note that the vector file MUST end with .kv in * order to be recognized as vector file.
      Returns:
      Returns all vocabulary entries without vectors in a String HashSet.
    • writeVocabularyToFile

      public void writeVocabularyToFile(String modelOrVectorPath, String fileToWritePath)
      Writes the vocabulary of the given gensim model to a text file (UTF-8 encoded).
      Parameters:
      modelOrVectorPath - The model of which the vocabulary shall be obtained.
      fileToWritePath - The file path of the file that shall be written.
    • writeVocabularyToFile

      public void writeVocabularyToFile(String modelOrVectorPath, File fileToWrite)
      Writes the vocabulary of the given gensim model to a text file (UTF-8 encoded).
      Parameters:
      modelOrVectorPath - The model of which the vocabulary shall be obtained.
      fileToWrite - The file that shall be written.
    • writeSetToFile

      private static <T> void writeSetToFile(File fileToWrite, Set<T> setToWrite)
      This method writes the content of a Set<String> to a file. The file will be UTF-8 encoded.
      Type Parameters:
      T - Type of the Set.
      Parameters:
      fileToWrite - File which will be created and in which the data will be written.
      setToWrite - Set whose content will be written into fileToWrite.
    • addModelToRequest

      private void addModelToRequest(org.apache.http.client.methods.HttpGet request, String modelOrVectorPath)
      Given a path to a model or vector file, this method determines whether it is a model or a vector file and adds the corresponding parameter to the request.
      Parameters:
      request - The request to which the model/vector file shall be added to.
      modelOrVectorPath - The path to the model/vector file.
    • getCanonicalPath

      private String getCanonicalPath(String filePath)
      Obtain the canonical model path.
      Parameters:
      filePath - The path to the gensim model or gensim vector file.
      Returns:
      The canonical model path as String.
    • getCanonicalPath

      private String getCanonicalPath(File file)
      Obtain the canonical model path.
      Parameters:
      file - the file to get the canonical path from
      Returns:
      The canonical path as String.
    • runGroupShuffleSplit

      public List<Integer> runGroupShuffleSplit(List<Integer> groups, double trainSize) throws Exception
      Throws:
      Exception
    • printHello

      private void printHello(String name)
      A quick technical demo. If the service works, it will print "Hello name".
      Parameters:
      name - The name that shall be printed.
    • runRequest

      private String runRequest(org.apache.http.client.methods.HttpUriRequest request) throws PythonServerException
      Throws:
      PythonServerException
    • getInstance

      public static PythonServer getInstance()
      Get the instance.
      Returns:
      Gensim instance.
    • getInstance

      public static PythonServer getInstance(File resourcesDirectory)
      Get the instance (singleton pattern).
      Parameters:
      resourcesDirectory - Directory where the files shall be copied to.
      Returns:
      Gensim Instance
    • checkRequirements

      public static boolean checkRequirements()
      Checks whether all Python requirements are installed and whether the server is functional.
      Returns:
      True if the server is fully functional, else false.
    • shutDown

      public static void shutDown()
      Shut down the service.
    • exportResource

      private void exportResource(File baseDirectory, String resourceName)
      Export a resource embedded into a Jar file to the local file path.
      Parameters:
      baseDirectory - The base directory.
      resourceName - ie.: "/SmartLibrary.dll"
    • startServer

      private boolean startServer()
      Initializes the server.
      Returns:
      True if successful, else false.
    • getLogLevel

      private String getLogLevel()
    • getPythonCommand

      protected String getPythonCommand()
      Returns the python command which is extracted from file melt-resources/python_command.txt.
      Returns:
      The python executable path.
    • updateEnvironmentPath

      protected void updateEnvironmentPath(Map<String,String> environment, String pythonCommand)
      Updates the environment variable PATH with additional python needed directories like env/lib/bin
      Parameters:
      environment - The environment to be changed.
      pythonCommand - The python executable path.
    • getPythonAdditionalPath

      protected String getPythonAdditionalPath(String pythonCommand)
      Returns a concatenated path of directories which can be used in the PATH variable. It searches based on a python executable path, all bin directories within the python dir.
      Parameters:
      pythonCommand - The python executable path.
      Returns:
      a concatenated path of directories which can be used in the PATH variable.
    • cosineSimilarity

      public static double cosineSimilarity(Double[] vector1, Double[] vector2)
      Calculate The cosine similarity between two vectors.
      Parameters:
      vector1 - First vector.
      vector2 - Second vector.
      Returns:
      Cosine similarity as double.
    • writeModelAsTextFile

      public void writeModelAsTextFile(String modelOrVectorPath, String fileToWrite)
      Writes the vectors to a human-readable text file.
      Parameters:
      modelOrVectorPath - The path to the model or vector file. Note that the vector file MUST end with .kv in * order to be recognized as vector file.
      fileToWrite - The file that will be written.
    • writeModelAsTextFile

      public void writeModelAsTextFile(String modelOrVectorPath, String fileToWrite, String entityFile)
      Writes the vectors to a human-readable text file.
      Parameters:
      modelOrVectorPath - The path to the model or vector file. Note that the vector file MUST end with .kv in * order to be recognized as vector file.
      fileToWrite - The file that will be written.
      entityFile - The vocabulary that shall appear in the text file (can be null if all words shall be written). The file must contain one word per line. The contents must be a subset of the vocabulary.
    • getResourcesDirectory

      public File getResourcesDirectory()
    • setPythonCommandBackup

      public static void setPythonCommandBackup(String pythonCommandBackup)
      Sets the python command programmatically. This is used when no external file python_command.txt is found.
      Parameters:
      pythonCommandBackup - the python command.
    • setOverridePythonFiles

      public static void setOverridePythonFiles(boolean overrideFiles)
      If set to true, all python files (e.g. python server melt and requirements.txt file) will be overridden with every execution. If you want to make changes to the python server (e.g. to develop and test a feature) you can set it to false. Then all modifications to these files will not be changed.
      Parameters:
      overrideFiles - if true, override the python server files.
    • getResourcesDirectoryPath

      public String getResourcesDirectoryPath()
      Get the resource directory as String.
      Returns:
      Directory as String.
    • setResourcesDirectory

      public void setResourcesDirectory(File resourcesDirectory)
      Set the directory where the python files will be copied to.
      Parameters:
      resourcesDirectory - Must be a directory.
    • getVocabularySize

      public int getVocabularySize(String modelOrVectorPath)
      Returns the size of the vocabulary of the stated model/vector set.
      Parameters:
      modelOrVectorPath - The path to the model or vector file. Note that the vector file MUST end with .kv in order to be recognized as vector file.
      Returns:
      -1 in case of an error else the size of the vocabulary.
    • isVectorCaching

      public boolean isVectorCaching()
      If true: enabled. Else: false.
      Returns:
      True if enabled, else false.
    • setVectorCaching

      public void setVectorCaching(boolean vectorCaching)
      If vector caching is turned on, similarities will be calculated on Java site (rather than in Python) and vectors are held in memories. Turn this function on, if you plan to do many computations with the same set of vectors. This will increase the performance at the cost of memory.
      Parameters:
      vectorCaching - True if caching shall be enabled, else false.
    • getPort

      public static int getPort()
    • setPort

      public static void setPort(int port)
    • getServerUrl

      public static String getServerUrl()