java.lang.Object
de.uni_mannheim.informatik.dws.melt.matching_jena_matchers.external.services.stringOperations.StringOperations

public class StringOperations extends Object
A helper class for string operations.
  • Field Details

    • LOGGER

      private static final org.slf4j.Logger LOGGER
    • separatingWords

      private static final HashSet<String> separatingWords
    • stopwords

      private static HashSet<String> stopwords
    • PATH_TO_STOPWORD_FILE

      private static final String PATH_TO_STOPWORD_FILE
      See Also:
    • PATH_TO_STOPWORD_FILE_JAR

      private static final String PATH_TO_STOPWORD_FILE_JAR
      See Also:
    • ENGLISH_NUMBER_WORDS_SET

      public static HashSet<String> ENGLISH_NUMBER_WORDS_SET
      A set containing nominal and cardinal numbers from 1 to 1ß
  • Constructor Details

    • StringOperations

      public StringOperations()
  • Method Details

    • isCamelCase

      public static boolean isCamelCase(String phrase)
      Function which indicates whether a phrase is in camel case or not.
      Parameters:
      phrase - The phrase to be checked.
      Returns:
      true if phrase is in camel case, else false.
    • isUnderscoreCase

      public static boolean isUnderscoreCase(String phrase)
      Function which indicates whether a phrase is in underscore case or not.
      Parameters:
      phrase - The phrase to be checked.
      Returns:
      True if phrase is in underscore case, else false.
    • isSpaceCase

      public static boolean isSpaceCase(String phrase)
      Function which indicates whether a phrase is space separated or not.
      Parameters:
      phrase - The phrase to be checked.
      Returns:
      True if space-separated, else false.
    • tokenizeCamelCase

      public static String[] tokenizeCamelCase(String phrase, StringOperations.AbbreviationHandler handler)
      Given a camel cased String, this method will split it into multiple tokens.
      Parameters:
      phrase - The phrase to be tokenized.
      handler - Determines how to handle abbreviations.
      Returns:
      The tokens of the phrase.
    • tokenizeSpaceCase

      public static String[] tokenizeSpaceCase(String phrase)
      Tokenizes phrase using strings.
      Parameters:
      phrase - The phrase to be tokenized.
      Returns:
      The tokens of the phrase.
    • tokenizeUnderScoreCase

      public static String[] tokenizeUnderScoreCase(String phrase)
      Tokenizes phrase using lower scores.
      Parameters:
      phrase - The phrase to be tokenized.
      Returns:
      The tokens of the phrase.
    • printStringArray

      public static void printStringArray(String[] stringArray)
      A method which prints the content of a string array to the command line.
      Parameters:
      stringArray - Array to be printed.
    • tokenizeBestGuess

      public static String[] tokenizeBestGuess(String phrase, StringOperations.AbbreviationHandler handler)
      Given an arbitrary phrase, the method determines which casing is used and applies the suited tokenizer. The tokenizer is not very aggressive. A '-' for instance, will not be used as splitter. For camel cased phrases with abbreviations, all combinations are determined if no handler is defined.
      Parameters:
      phrase - The phrase to be tokenized.
      handler - The handler which determines how abbreviations shall be handled.
      Returns:
      Tokens.
    • tokenizeWithoutCamelCaseRecognition

      private static String[] tokenizeWithoutCamelCaseRecognition(String phrase)
      Split using slash, underscore and space.
      Parameters:
      phrase - Phrase to be splitted.
      Returns:
      Array of individual tokens.
    • tokenizeCamelCaseAndSlash

      public static String[] tokenizeCamelCaseAndSlash(String phrase, StringOperations.AbbreviationHandler handler)
      Tokenize and use camelCase and slashes as tokenization tokens.
      Parameters:
      phrase - The phrase to be tokenized.
      handler - Abbreviation handler.
      Returns:
      String array of tokens.
    • tokenizeBestGuess

      public static String[] tokenizeBestGuess(String phrase)
      Given an arbitrary phrase, the method determines which casing is used and applies the suited tokenizer. For camel cased phrases with abbreviations, it is assumed that an upper case follows an abbreviation.
      Parameters:
      phrase - The phrase to be tokenized.
      Returns:
      Tokens.
    • getNumberOfTokensBestGuess

      public static int getNumberOfTokensBestGuess(String phrase, StringOperations.AbbreviationHandler handler)
      Returns the number of tokens that were found in a phrase.
      Parameters:
      phrase - The phrase to be checked.
      handler - defines the handling of abbreviations. Note that AbbreviationHandler.CONSIDER_ALL leads to more tokens than actually exist because combinations are employed.
      Returns:
      Number of tokens.
    • getNumberOfTokensBestGuess

      public static int getNumberOfTokensBestGuess(String phrase)
      Returns the number of tokens that were found in a phrase. Note that the number of tokens is obtained using AbbreviationHandler.UPPER_CASE_FOLLOWS_ABBREVIATION in the default case. Note further that stopword removal is not taken into account. Be careful when mixing with stopword removal.
      Parameters:
      phrase - The phrase that shall be checked.
      Returns:
      The number of tokens.
    • containsSplitWords

      public static boolean containsSplitWords(String phrase)
      Parameters:
      phrase - The phrase to be checked.
      Returns:
      True if the phrase contains split words.
    • containsSplitWords

      public static boolean containsSplitWords(String[] phraseTokens)
      Parameters:
      phraseTokens - The tokens that shall be processed.
      Returns:
      True if the tokens contain split words, else false.
    • splitUsingSplitWords

      public static String[] splitUsingSplitWords(String[] phraseTokens)
    • concatArray

      private static String concatArray(String[] array)
      Concatenates a string array to one string separated by spaces.
      Parameters:
      array - Array that shall be concatenated.
      Returns:
      Concatenated array as String.
    • cleanStringForDBpediaQuery

      public static String cleanStringForDBpediaQuery(String inputString)
      This method removes illegal characters of a string when used in a SPARQL query.
      Parameters:
      inputString - Input String.
      Returns:
      Edited String.
    • reduceToLettersOnly

      public static String reduceToLettersOnly(String string)
      Cleans a string from anything that is not a letter.
      Parameters:
      string - String to be cleaned.
      Returns:
      Cleaned String.
    • writeSetToFile

      public static <T> void writeSetToFile(File fileToWrite, Set<T> setToWrite)
      This method writes the content of a Set<String> to a file. The file will be UTF-8 encoded.
      Type Parameters:
      T - Type of the Set.
      Parameters:
      fileToWrite - File which will be created and in which the data will be written.
      setToWrite - Set whose content will be written into fileToWrite.
    • readSetFromFile

      @NotNull public static @NotNull Set<String> readSetFromFile(String filePath)
      Reads a Set from the file as specified by the file path.
      Parameters:
      filePath - The path to the file that is to be read.
      Returns:
      The parsed file as Set.
    • readSetFromFile

      @NotNull public static @NotNull Set<String> readSetFromFile(File file)
      Reads a Set from the file as specified by the file.
      Parameters:
      file - The file that is to be read.
      Returns:
      The parsed file as Set.
    • readListFromFile

      @NotNull public static @NotNull List<String> readListFromFile(String filePath)
      Reads a List from the file as specified by the file path.
      Parameters:
      filePath - The path to the file that is to be read.
      Returns:
      The parsed file as List.
    • readListFromFile

      @NotNull public static @NotNull List<String> readListFromFile(File file)
      Reads a List from the file as specified by the file.
      Parameters:
      file - The file that is to be read.
      Returns:
      The parsed file as List.
    • convertToTag

      public static String convertToTag(String stringToConvert)
      Converts a string to a tag. Example: "Hagrid" will be converted to "<Hagrid>". If the string is already a tag, the string will be returned as it is.s
      Parameters:
      stringToConvert - The String which shall be converted to a tag.
      Returns:
      The String as tag.
    • removeTag

      public static String removeTag(String tagToConvert)
      Removes the tags of a tag. Example: "<Hagrid>" will be converted to "Hagrid".
      Parameters:
      tagToConvert - The tag which shall be converted.
      Returns:
      The string as non-tag.
    • addTagIfNotExists

      public static String addTagIfNotExists(String addTagString)
      Adds tags if they are not there yet. "<Hagrid>" will be converted to "<Hagrid>", "Hagrid" will be converted to "<Hagrid>", "<Hagrid" will be converted to "<Hagrid>" etc.
      Parameters:
      addTagString - String to which tags shall be added.
      Returns:
      Tagged string.
    • removeEnglishPlural

      public static String removeEnglishPlural(String stringToBeModified)
      Remove the plural in English words.
      Parameters:
      stringToBeModified - The string that shall be modified.
      Returns:
      Modified string.
    • removeLanguageAnnotation

      public static String removeLanguageAnnotation(String s)
      Removes the language annotation from a string. If the string does not have a language annotation, the string will be returned unchanged. Example: "Hagrid@en" will be changed to "Hagrid".
      Parameters:
      s - String to be changed.
      Returns:
      String without language annotation.
    • cleanValueFromTypeAnnotation

      public static String cleanValueFromTypeAnnotation(String valueToClean)
      Will clean a value from a type annotation. Example. "0.816318^^http://www.w3.org/2001/XMLSchema#float" will be cleaned to 0.816318.
      Parameters:
      valueToClean - The value that shall be cleaned.
      Returns:
      The cleaned value as String.
    • isSameStringStemming

      public static boolean isSameStringStemming(String s1, String s2)
      This method checks whether two Strings are very similar by performing simple string operations including Porter's stemmer.
      Parameters:
      s1 - String 1.
      s2 - String 2.
      Returns:
      boolean
    • isSameString

      public static boolean isSameString(String s1, String s2)
      This method checks whether two Strings are very similar by performing simple string operations. Stopwords are retained.
      Parameters:
      s1 - String 1
      s2 - String 2
      Returns:
      boolean
    • isSameStringIgnoringStopwordsAndNumbersWithSpellingCorrection

      public static boolean isSameStringIgnoringStopwordsAndNumbersWithSpellingCorrection(String s1, String s2, float maxAllowedEditDistance)
    • hasSimilarTokenWriting

      public static boolean hasSimilarTokenWriting(String[] sarray1, String[] sarray2, float tolerance)
      Checks whether two arrays have a similar writing. Every token is matched to its most similar token. Tokens can be used multiple times.
      Parameters:
      sarray1 - Array 1
      sarray2 - Array 2
      tolerance - The minimal tolerance that is allowed.
      Returns:
      True if the distance is less or equal to the allowed distance.
    • getLevenshteinDistanceSimilarTokensOneWay

      public static float getLevenshteinDistanceSimilarTokensOneWay(String[] sarray1, String[] sarray2)
      Return the Levenshtein similarity between two token sets. This is only a one-way test: if sarray2 contains all tokens of sarray1, then the distance will be 0 even though sarray2 might contain additional tokens that are not contained in sarray2. Tokens can be used multiple times
      Parameters:
      sarray1 - Array 1
      sarray2 - Array 2
      Returns:
      Distance as float.
    • isSameStringIgnoringStopwords

      public static boolean isSameStringIgnoringStopwords(String s1, String s2)
      This method checks whether two Strings are very similar by performing simple string operations. Stopwords are removed.
      Parameters:
      s1 - String 1
      s2 - String 2
      Returns:
      boolean
    • isSameStringIgnoringStopwordsAndNumbers

      public static boolean isSameStringIgnoringStopwordsAndNumbers(String s1, String s2)
      This method checks whether two Strings are very similar by performing simple string operations. Stopwords and numbers are removed.
      Parameters:
      s1 - String 1
      s2 - String 2
      Returns:
      boolean
    • clearArrayFromStopwords

      public static String[] clearArrayFromStopwords(String[] arrayWithStopwords)
      Returns an array cleaned from stopwords. Retains the ordering.
      Parameters:
      arrayWithStopwords - Array with stopwords.
      Returns:
      Array without stopwords.
    • clearHashSetFromStopwords

      public static HashSet<String> clearHashSetFromStopwords(HashSet<String> hashSetWithStopwords)
      Removes the stopwords from the given HashSet.
      Parameters:
      hashSetWithStopwords - HashSet from which the stopwords shall be removed.
      Returns:
      Cleared HashSet
    • removeEnglishGenitiveS

      public static String[] removeEnglishGenitiveS(String[] array)
      Removes free floating "s", "S", and cuts "'s".
      Parameters:
      array - Array to be transformed.
      Returns:
      New array.
    • removeEnglishGenitiveS

      public static HashSet<String> removeEnglishGenitiveS(HashSet<String> set)
      Remove free floating s from the given set.
      Parameters:
      set - Set from which s shall be removed.
      Returns:
      Set with removed s/S.
    • stemPorter

      public static String stemPorter(String word)
      Wrapping of Porter's Stemming Code.
      Parameters:
      word - Word to be stemmed.
      Returns:
      Stemmed word.
    • lazyInitStopwords

      private static void lazyInitStopwords()
      Initialize reading stopwords file if it has not been read before.
    • initStopwords

      public static void initStopwords()
      Initialize reading stopwords.
    • isMeaningfulFragment

      public static boolean isMeaningfulFragment(String fragment)
      Checks whether a fragment is meaningful by counting the number of digits.
      Parameters:
      fragment - The fragment for which relevance shall be checked.
      Returns:
      Returns false if at least half of the fragment is composed of digits.
    • addAlternativeWritingsSimple

      public static HashSet<String> addAlternativeWritingsSimple(HashSet<String> set)
      Generate alternative writings (particularly interesting for English and German hyphenation).
      Parameters:
      set - The set which shall be processed..
      Returns:
      The new set with alternative writings.
    • removeNumbers

      public static HashSet<String> removeNumbers(HashSet<String> set)
      Remove numbers from a set of strings.
      Parameters:
      set - Set from which numbers shall be removed.
      Returns:
      A new set with no number instances.
    • clearArrayFromNumbers

      public static String[] clearArrayFromNumbers(String[] array)
      Given a String array, numeric tokens will be removed.
      Parameters:
      array - The array from which numeric components shall be removed.
      Returns:
      The new array will be of smaller length while the order of tokens will be retained.
    • isNaturalNumber

      public static boolean isNaturalNumber(String stringToBeChecked)
      Returns whether the stringToBeChecked is a number e.g. '123' or 'XI'. For reasons of performance, the syntax of roman numbers is not checked.
      Parameters:
      stringToBeChecked - The string for numeric properties shall be checked.
      Returns:
      True if roman or arabic number, else false.
    • isEnglishNumberWord

      public static boolean isEnglishNumberWord(String stringToBeChecked)
      Checks whether the stringToBeChecked is a nominal or cardinal number in English in written format. The number must be between 0 and 10 in order to be detected.
      Parameters:
      stringToBeChecked - The string that shall be checked.
      Returns:
      True if the String is an English number word (e.g. 'nine' or 'fifth'), else false.
    • removeNonAlphanumericCharacters

      public static String removeNonAlphanumericCharacters(String stringWithPunctuation)
      Removes everything that is not a digit, character, space, or underscore. Note: In English, this may lead to a concatenations of the genitive s together with the latter word e.g. that's → thats. It might make sense to remove those first.
      Parameters:
      stringWithPunctuation - String with punctuation.
      Returns:
      String without punctuation.
    • removeEnglishGenitiveS

      public static String removeEnglishGenitiveS(String string)
      Removes the English genitive s.
      Parameters:
      string - String that might contain genitive s.
      Returns:
      Edited String.
    • getCommaSeparatedString

      public static String getCommaSeparatedString(HashSet<String> set)
      Get a comma separated list of the given HashSet<String>.
      Parameters:
      set - The set that shall be represented as comma separated String.
      Returns:
      The elements of the Set in a String separated by a comma.