Package de.uni_mannheim.informatik.dws.melt.matching_jena_matchers.external.services.stringOperations
Class StringOperations
java.lang.Object
de.uni_mannheim.informatik.dws.melt.matching_jena_matchers.external.services.stringOperations.StringOperations
A helper class for string operations.
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic enum
Enum which indicates how shortcuts in camel case are handeled. -
Field Summary
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionGenerate alternative writings (particularly interesting for English and German hyphenation).static String
addTagIfNotExists
(String addTagString) Adds tags if they are not there yet.static String
cleanStringForDBpediaQuery
(String inputString) This method removes illegal characters of a string when used in a SPARQL query.static String
cleanValueFromTypeAnnotation
(String valueToClean) Will clean a value from a type annotation.static String[]
clearArrayFromNumbers
(String[] array) Given a String array, numeric tokens will be removed.static String[]
clearArrayFromStopwords
(String[] arrayWithStopwords) Returns an array cleaned from stopwords.clearHashSetFromStopwords
(HashSet<String> hashSetWithStopwords) Removes the stopwords from the given HashSet.private static String
concatArray
(String[] array) Concatenates a string array to one string separated by spaces.static boolean
containsSplitWords
(String phrase) static boolean
containsSplitWords
(String[] phraseTokens) static String
convertToTag
(String stringToConvert) Converts a string to a tag.static String
Get a comma separated list of the givenHashSet<String>
.static float
getLevenshteinDistanceSimilarTokensOneWay
(String[] sarray1, String[] sarray2) Return the Levenshtein similarity between two token sets.static int
getNumberOfTokensBestGuess
(String phrase) Returns the number of tokens that were found in a phrase.static int
getNumberOfTokensBestGuess
(String phrase, StringOperations.AbbreviationHandler handler) Returns the number of tokens that were found in a phrase.static boolean
hasSimilarTokenWriting
(String[] sarray1, String[] sarray2, float tolerance) Checks whether two arrays have a similar writing.static void
Initialize reading stopwords.static boolean
isCamelCase
(String phrase) Function which indicates whether a phrase is in camel case or not.static boolean
isEnglishNumberWord
(String stringToBeChecked) Checks whether the stringToBeChecked is a nominal or cardinal number in English in written format.static boolean
isMeaningfulFragment
(String fragment) Checks whether a fragment is meaningful by counting the number of digits.static boolean
isNaturalNumber
(String stringToBeChecked) Returns whether the stringToBeChecked is a number e.g.static boolean
isSameString
(String s1, String s2) This method checks whether two Strings are very similar by performing simple string operations.static boolean
This method checks whether two Strings are very similar by performing simple string operations.static boolean
This method checks whether two Strings are very similar by performing simple string operations.static boolean
isSameStringIgnoringStopwordsAndNumbersWithSpellingCorrection
(String s1, String s2, float maxAllowedEditDistance) static boolean
isSameStringStemming
(String s1, String s2) This method checks whether two Strings are very similar by performing simple string operations including Porter's stemmer.static boolean
isSpaceCase
(String phrase) Function which indicates whether a phrase is space separated or not.static boolean
isUnderscoreCase
(String phrase) Function which indicates whether a phrase is in underscore case or not.private static void
Initialize reading stopwords file if it has not been read before.static void
printStringArray
(String[] stringArray) A method which prints the content of a string array to the command line.readListFromFile
(File file) Reads a List from the file as specified by the file.readListFromFile
(String filePath) Reads a List from the file as specified by the file path.readSetFromFile
(File file) Reads a Set from the file as specified by the file.readSetFromFile
(String filePath) Reads a Set from the file as specified by the file path.static String
reduceToLettersOnly
(String string) Cleans a string from anything that is not a letter.static String
removeEnglishGenitiveS
(String string) Removes the English genitive s.static String[]
removeEnglishGenitiveS
(String[] array) Removes free floating "s", "S", and cuts "'s".Remove free floating s from the given set.static String
removeEnglishPlural
(String stringToBeModified) Remove the plural in English words.static String
Removes the language annotation from a string.static String
removeNonAlphanumericCharacters
(String stringWithPunctuation) Removes everything that is not a digit, character, space, or underscore.removeNumbers
(HashSet<String> set) Remove numbers from a set of strings.static String
Removes the tags of a tag.static String[]
splitUsingSplitWords
(String[] phraseTokens) static String
stemPorter
(String word) Wrapping of Porter's Stemming Code.static String[]
tokenizeBestGuess
(String phrase) Given an arbitrary phrase, the method determines which casing is used and applies the suited tokenizer.static String[]
tokenizeBestGuess
(String phrase, StringOperations.AbbreviationHandler handler) Given an arbitrary phrase, the method determines which casing is used and applies the suited tokenizer.static String[]
tokenizeCamelCase
(String phrase, StringOperations.AbbreviationHandler handler) Given a camel cased String, this method will split it into multiple tokens.static String[]
tokenizeCamelCaseAndSlash
(String phrase, StringOperations.AbbreviationHandler handler) Tokenize and use camelCase and slashes as tokenization tokens.static String[]
tokenizeSpaceCase
(String phrase) Tokenizes phrase using strings.static String[]
tokenizeUnderScoreCase
(String phrase) Tokenizes phrase using lower scores.private static String[]
Split using slash, underscore and space.static <T> void
writeSetToFile
(File fileToWrite, Set<T> setToWrite) This method writes the content of aSet<String>
to a file.
-
Field Details
-
LOGGER
private static final org.slf4j.Logger LOGGER -
separatingWords
-
stopwords
-
PATH_TO_STOPWORD_FILE
- See Also:
-
PATH_TO_STOPWORD_FILE_JAR
- See Also:
-
ENGLISH_NUMBER_WORDS_SET
A set containing nominal and cardinal numbers from 1 to 1ß
-
-
Constructor Details
-
StringOperations
public StringOperations()
-
-
Method Details
-
isCamelCase
Function which indicates whether a phrase is in camel case or not.- Parameters:
phrase
- The phrase to be checked.- Returns:
- true if phrase is in camel case, else false.
-
isUnderscoreCase
Function which indicates whether a phrase is in underscore case or not.- Parameters:
phrase
- The phrase to be checked.- Returns:
- True if phrase is in underscore case, else false.
-
isSpaceCase
Function which indicates whether a phrase is space separated or not.- Parameters:
phrase
- The phrase to be checked.- Returns:
- True if space-separated, else false.
-
tokenizeCamelCase
public static String[] tokenizeCamelCase(String phrase, StringOperations.AbbreviationHandler handler) Given a camel cased String, this method will split it into multiple tokens.- Parameters:
phrase
- The phrase to be tokenized.handler
- Determines how to handle abbreviations.- Returns:
- The tokens of the phrase.
-
tokenizeSpaceCase
Tokenizes phrase using strings.- Parameters:
phrase
- The phrase to be tokenized.- Returns:
- The tokens of the phrase.
-
tokenizeUnderScoreCase
Tokenizes phrase using lower scores.- Parameters:
phrase
- The phrase to be tokenized.- Returns:
- The tokens of the phrase.
-
printStringArray
A method which prints the content of a string array to the command line.- Parameters:
stringArray
- Array to be printed.
-
tokenizeBestGuess
public static String[] tokenizeBestGuess(String phrase, StringOperations.AbbreviationHandler handler) Given an arbitrary phrase, the method determines which casing is used and applies the suited tokenizer. The tokenizer is not very aggressive. A '-' for instance, will not be used as splitter. For camel cased phrases with abbreviations, all combinations are determined if no handler is defined.- Parameters:
phrase
- The phrase to be tokenized.handler
- The handler which determines how abbreviations shall be handled.- Returns:
- Tokens.
-
tokenizeWithoutCamelCaseRecognition
Split using slash, underscore and space.- Parameters:
phrase
- Phrase to be splitted.- Returns:
- Array of individual tokens.
-
tokenizeCamelCaseAndSlash
public static String[] tokenizeCamelCaseAndSlash(String phrase, StringOperations.AbbreviationHandler handler) Tokenize and use camelCase and slashes as tokenization tokens.- Parameters:
phrase
- The phrase to be tokenized.handler
- Abbreviation handler.- Returns:
- String array of tokens.
-
tokenizeBestGuess
Given an arbitrary phrase, the method determines which casing is used and applies the suited tokenizer. For camel cased phrases with abbreviations, it is assumed that an upper case follows an abbreviation.- Parameters:
phrase
- The phrase to be tokenized.- Returns:
- Tokens.
-
getNumberOfTokensBestGuess
public static int getNumberOfTokensBestGuess(String phrase, StringOperations.AbbreviationHandler handler) Returns the number of tokens that were found in a phrase.- Parameters:
phrase
- The phrase to be checked.handler
- defines the handling of abbreviations. Note that AbbreviationHandler.CONSIDER_ALL leads to more tokens than actually exist because combinations are employed.- Returns:
- Number of tokens.
-
getNumberOfTokensBestGuess
Returns the number of tokens that were found in a phrase. Note that the number of tokens is obtained using AbbreviationHandler.UPPER_CASE_FOLLOWS_ABBREVIATION in the default case. Note further that stopword removal is not taken into account. Be careful when mixing with stopword removal.- Parameters:
phrase
- The phrase that shall be checked.- Returns:
- The number of tokens.
-
containsSplitWords
- Parameters:
phrase
- The phrase to be checked.- Returns:
- True if the phrase contains split words.
-
containsSplitWords
- Parameters:
phraseTokens
- The tokens that shall be processed.- Returns:
- True if the tokens contain split words, else false.
-
splitUsingSplitWords
-
concatArray
Concatenates a string array to one string separated by spaces.- Parameters:
array
- Array that shall be concatenated.- Returns:
- Concatenated array as String.
-
cleanStringForDBpediaQuery
This method removes illegal characters of a string when used in a SPARQL query.- Parameters:
inputString
- Input String.- Returns:
- Edited String.
-
reduceToLettersOnly
Cleans a string from anything that is not a letter.- Parameters:
string
- String to be cleaned.- Returns:
- Cleaned String.
-
writeSetToFile
This method writes the content of aSet<String>
to a file. The file will be UTF-8 encoded.- Type Parameters:
T
- Type of the Set.- Parameters:
fileToWrite
- File which will be created and in which the data will be written.setToWrite
- Set whose content will be written into fileToWrite.
-
readSetFromFile
Reads a Set from the file as specified by the file path.- Parameters:
filePath
- The path to the file that is to be read.- Returns:
- The parsed file as Set.
-
readSetFromFile
Reads a Set from the file as specified by the file.- Parameters:
file
- The file that is to be read.- Returns:
- The parsed file as Set.
-
readListFromFile
Reads a List from the file as specified by the file path.- Parameters:
filePath
- The path to the file that is to be read.- Returns:
- The parsed file as List.
-
readListFromFile
Reads a List from the file as specified by the file.- Parameters:
file
- The file that is to be read.- Returns:
- The parsed file as List.
-
convertToTag
Converts a string to a tag. Example: "Hagrid" will be converted to "<Hagrid>". If the string is already a tag, the string will be returned as it is.s- Parameters:
stringToConvert
- The String which shall be converted to a tag.- Returns:
- The String as tag.
-
removeTag
Removes the tags of a tag. Example: "<Hagrid>" will be converted to "Hagrid".- Parameters:
tagToConvert
- The tag which shall be converted.- Returns:
- The string as non-tag.
-
addTagIfNotExists
Adds tags if they are not there yet. "<Hagrid>" will be converted to "<Hagrid>", "Hagrid" will be converted to "<Hagrid>", "<Hagrid" will be converted to "<Hagrid>" etc.- Parameters:
addTagString
- String to which tags shall be added.- Returns:
- Tagged string.
-
removeEnglishPlural
Remove the plural in English words.- Parameters:
stringToBeModified
- The string that shall be modified.- Returns:
- Modified string.
-
removeLanguageAnnotation
Removes the language annotation from a string. If the string does not have a language annotation, the string will be returned unchanged. Example: "Hagrid@en" will be changed to "Hagrid".- Parameters:
s
- String to be changed.- Returns:
- String without language annotation.
-
cleanValueFromTypeAnnotation
Will clean a value from a type annotation. Example. "0.816318^^http://www.w3.org/2001/XMLSchema#float" will be cleaned to 0.816318.- Parameters:
valueToClean
- The value that shall be cleaned.- Returns:
- The cleaned value as String.
-
isSameStringStemming
This method checks whether two Strings are very similar by performing simple string operations including Porter's stemmer.- Parameters:
s1
- String 1.s2
- String 2.- Returns:
- boolean
-
isSameString
This method checks whether two Strings are very similar by performing simple string operations. Stopwords are retained.- Parameters:
s1
- String 1s2
- String 2- Returns:
- boolean
-
isSameStringIgnoringStopwordsAndNumbersWithSpellingCorrection
-
hasSimilarTokenWriting
Checks whether two arrays have a similar writing. Every token is matched to its most similar token. Tokens can be used multiple times.- Parameters:
sarray1
- Array 1sarray2
- Array 2tolerance
- The minimal tolerance that is allowed.- Returns:
- True if the distance is less or equal to the allowed distance.
-
getLevenshteinDistanceSimilarTokensOneWay
Return the Levenshtein similarity between two token sets. This is only a one-way test: if sarray2 contains all tokens of sarray1, then the distance will be 0 even though sarray2 might contain additional tokens that are not contained in sarray2. Tokens can be used multiple times- Parameters:
sarray1
- Array 1sarray2
- Array 2- Returns:
- Distance as float.
-
isSameStringIgnoringStopwords
This method checks whether two Strings are very similar by performing simple string operations. Stopwords are removed.- Parameters:
s1
- String 1s2
- String 2- Returns:
- boolean
-
isSameStringIgnoringStopwordsAndNumbers
This method checks whether two Strings are very similar by performing simple string operations. Stopwords and numbers are removed.- Parameters:
s1
- String 1s2
- String 2- Returns:
- boolean
-
clearArrayFromStopwords
Returns an array cleaned from stopwords. Retains the ordering.- Parameters:
arrayWithStopwords
- Array with stopwords.- Returns:
- Array without stopwords.
-
clearHashSetFromStopwords
Removes the stopwords from the given HashSet.- Parameters:
hashSetWithStopwords
- HashSet from which the stopwords shall be removed.- Returns:
- Cleared HashSet
-
removeEnglishGenitiveS
Removes free floating "s", "S", and cuts "'s".- Parameters:
array
- Array to be transformed.- Returns:
- New array.
-
removeEnglishGenitiveS
Remove free floating s from the given set.- Parameters:
set
- Set from which s shall be removed.- Returns:
- Set with removed s/S.
-
stemPorter
Wrapping of Porter's Stemming Code.- Parameters:
word
- Word to be stemmed.- Returns:
- Stemmed word.
-
lazyInitStopwords
private static void lazyInitStopwords()Initialize reading stopwords file if it has not been read before. -
initStopwords
public static void initStopwords()Initialize reading stopwords. -
isMeaningfulFragment
Checks whether a fragment is meaningful by counting the number of digits.- Parameters:
fragment
- The fragment for which relevance shall be checked.- Returns:
- Returns false if at least half of the fragment is composed of digits.
-
addAlternativeWritingsSimple
Generate alternative writings (particularly interesting for English and German hyphenation).- Parameters:
set
- The set which shall be processed..- Returns:
- The new set with alternative writings.
-
removeNumbers
Remove numbers from a set of strings.- Parameters:
set
- Set from which numbers shall be removed.- Returns:
- A new set with no number instances.
-
clearArrayFromNumbers
Given a String array, numeric tokens will be removed.- Parameters:
array
- The array from which numeric components shall be removed.- Returns:
- The new array will be of smaller length while the order of tokens will be retained.
-
isNaturalNumber
Returns whether the stringToBeChecked is a number e.g. '123' or 'XI'. For reasons of performance, the syntax of roman numbers is not checked.- Parameters:
stringToBeChecked
- The string for numeric properties shall be checked.- Returns:
- True if roman or arabic number, else false.
-
isEnglishNumberWord
Checks whether the stringToBeChecked is a nominal or cardinal number in English in written format. The number must be between 0 and 10 in order to be detected.- Parameters:
stringToBeChecked
- The string that shall be checked.- Returns:
- True if the String is an English number word (e.g. 'nine' or 'fifth'), else false.
-
removeNonAlphanumericCharacters
Removes everything that is not a digit, character, space, or underscore. Note: In English, this may lead to a concatenations of the genitive s together with the latter word e.g. that's → thats. It might make sense to remove those first.- Parameters:
stringWithPunctuation
- String with punctuation.- Returns:
- String without punctuation.
-
removeEnglishGenitiveS
Removes the English genitive s.- Parameters:
string
- String that might contain genitive s.- Returns:
- Edited String.
-
getCommaSeparatedString
Get a comma separated list of the givenHashSet<String>
.- Parameters:
set
- The set that shall be represented as comma separated String.- Returns:
- The elements of the Set in a String separated by a comma.
-