public class JaccardCoeficientsFile extends DuplicityCheckingFile
AllSimilarUnitPairsFile
.
Just like the "similar unit pairs" files this file contains only pairs where a < b.
The file contains instances of JaccardCoeficient
class.
That means it contains triples {first, second, num}, where first, second are identificators of units
on which we check duplicity (can be document, paragraph or sentence)
and num is the number of occurences of the pair in underlying
AllSimilarUnitPairsFile
, which is the
number of permutations according which the two units are similar.
The file is sorted - the main criteria is first field, in case of tie second field.CommonSimilarUnitPairsFile
using AllSimilarUnitPairsFile.mergeToJaccardCoeficientsFile(java.util.ArrayList<org.egothor.duplicity.file.CommonSimilarUnitPairsFile>)
or read from filesystem using constructor(String location). merge(org.egothor.duplicity.file.JaccardCoeficientsFile, org.egothor.duplicity.file.JaccardCoeficientsFile)
method.Constructor and Description |
---|
JaccardCoeficientsFile(String location) |
Modifier and Type | Method and Description |
---|---|
String |
dump()
Dumps the file with its content to String.
|
Map<TextUnitID,JaccardCoeficient> |
filterRelevantForDocument(DocumentUnitID doc)
Filter from the file only the entries relevant for given document.
|
String |
getFilename()
Returns the filename corresponding to this file.
|
boolean |
hasTheSameContent(DuplicityCheckingFile file)
Checks if two files has the same content.
|
Map<DocumentUnitID,Double> |
markDuplicates(List<DocumentData> docs) |
void |
merge(JaccardCoeficientsFile jcf1,
JaccardCoeficientsFile jcf2)
Merges files externally on filesystem.
|
void |
remove(Set<DocumentUnitID> toRemove)
Removes all occurences of documents given in the set from the file.
|
delete, getLocation, getOut, hasTheSameContent, toString
public JaccardCoeficientsFile(String location) throws IOException, DuplicityCheckingException
location
- IOException
DuplicityCheckingException
public void merge(JaccardCoeficientsFile jcf1, JaccardCoeficientsFile jcf2) throws IOException
mergeAll(org.egothor.duplicity.file.JaccardCoeficientsFile, java.util.ArrayList<org.egothor.duplicity.file.JaccardCoeficientsFile>)
.jcf1
- a file to be merged into thisjcf2
- a file to be merged into thisIOException
- if temporary file could not be createdmergeAll(org.egothor.duplicity.file.JaccardCoeficientsFile, java.util.ArrayList<org.egothor.duplicity.file.JaccardCoeficientsFile>)
public Map<DocumentUnitID,Double> markDuplicates(List<DocumentData> docs) throws IOException, DuplicityCheckingException
docs
- IOException
DuplicityCheckingException
public String getFilename()
Constants.JACCARD_COEFICIENTS_FILE_NAME
.getFilename
in class DuplicityCheckingFile
public String dump()
public void remove(Set<DocumentUnitID> toRemove) throws IOException
toRemove
- set of document ids to removeIOException
public Map<TextUnitID,JaccardCoeficient> filterRelevantForDocument(DocumentUnitID doc) throws IOException
Constants.SIMILARITY_RELEVANT_TRESHOLD
).doc
- document for which the relevant entries are requestedIOException
- on error while reading filesystem filepublic boolean hasTheSameContent(DuplicityCheckingFile file)
file
- the second file to be testedCopyright © 2016 Egothor. All Rights Reserved.