This chapter describes how to prepare and build egothor2-based project. Egothor2 is developed within Maven environment, but you can still use it within the classic Ant project.
Download dependencies to compile and build egothor2. To parse HTML, download NekoHTML parser as well.
Other libraries are not required for basic application we are trying to develop in this quick programming guide.
Include the following fragment in your project POM-file.
<repositories> : : <repository> <id>egothor.org</id> <name>Egothor.org repository</name> <url>http://www.egothor.org/download/release/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> <layout>default</layout> </repository> <repository> <id>egothor.org.snapshot</id> <name>Egothor.org snapshot repository</name> <url>http://www.egothor.org/download/snapshot/</url> <releases> <enabled>false</enabled> </releases> <snapshots> <enabled>true</enabled> </snapshots> <layout>default</layout> </repository> : : </repositories> <dependencies> : : <dependency> <groupId>egothor</groupId> <artifactId>egothor2</artifactId> <version>[4.0,5.0)</version> </dependency> : : </dependencies>
You can use egothor2 as a IR library for your application. Egothor2 provides:
This section shows the basics of these capabilities.
The index is often stored on disk. In this guide, we will use location directory.
repo = new MapDbRepository(location); TankerImplSecure tankerOrig = new TankerImplSecure(); tankerOrig.initializeTankerSecure(location, repo, true, false, 32, 10, null, 10000); tanker = new BufferedTanker(tankerOrig);
This fragment allocates MapDbRepository to store raw document bodies, see repo. Other implementations are also available. The index will be managed by ACID-capable object, see tankerOrig. For practical reasons, we will rather use a buffered instance tanker to speed up indexing. The buffered instance indexes input data in batches, a search phase is not modified or improved.
Now, we are ready to index our first item. Anything what can be transformed into a stream of tokens can be indexed. Egothor2 offers basic parsers and transformers, so that the job is easier for you. Obviously, you can always add parsers for your specific data. The following fragment reads HTML:
Document document = new Document(); DocumentData meta = new DocumentData(getUID(datastreamlocation)); Reader dataReader = new java.io.InputStreamReader( ..., fileCharset ); HTMLField root = new HTMLField(false/*extract hrefs*/, false/*extract img srces*/, true/*lowercase*/, false/*no phonetics alg*/, true/*remove diacritics*/); root.initialize(dataReader, datastreamlocation/*base URI for href+imgsrc URI resolve*/, meta, null); document.initialize(meta, root); meta.setData("text/html", "data description", fileCharset); meta.setDateTime(...);
The item must specify the stream of tokens and item metadata. The stream is represented by root object which reads and parses HTML stream. You can set basic transformers, e.g. uppercase or diacritics removal. Metadata includes various information, e.g. data timestamp, some description (to be displayed in a hit list), content-type, etc.
The stream and metadata initialize a document object. Now, all we need to do is:
tanker.append(document);
This method may be invoked many times. Unless you commit the tanker instance, nothing will be visible to other processing threads or other JVM processes. Thus, the transaction must be confirmed by:
tanker.commit();
Do not call commit after a single append operation - it would be ineffective. Use batches, if possible.
Finally, open structures must be closed:
tanker.close(); repo.close();
For more details, see org.egothor.apps.Directory program.
The repo instance may be shared by many tankers.
Yes, via tokens tags.
Yes, unless you want to keep all revisions. This can be configured in respective Tankers.
The search may be executed on a tanker you use for indexing, or you can create a new instance.
repo = new MapDbRepository(location); TankerImplSecure tankerOrig = new TankerImplSecure(); tankerOrig.initializeTankerSecure(location, repo, true, false, 32, 10, null, 10000);
Now, you are ready to issue full-text queries:
QueryResponse qr = tankerOrig.querySecure( 0/*1st hit offset, 0=top*/, 10/*number of hits*/, 1/*model, 1=classic vector model*/, "your google like query", Long.MAX_VALUE/*process full index, all items=MAX_VALUE*/, 0.0 /*do not rerank pagerank factor*/); Sequence<Hit> e = qr.getResult(); System.out.println("Hits (scanned): " + qr.getHitsScanned()); System.out.println("Hits (guess): " + qr.getWouldBe()); int offset = qr.getOffset(); Hit hit; while ((hit = e.next()) != null) { offset++; System.out.println(offset + ". " + hit.sim + " " + hit.getMeta()); }
The method hit.getMeta() returns an instance of your original metadata submitted in the indexing phase.
The core also records its understanding of your query, it can be displayed via:
: printQuery(qr.getAdaptedQuery()); : public static void printQuery(org.egothor.core.query.Query q) { if (q != null) { try { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder parser = factory.newDocumentBuilder(); DOMImplementation impl = parser.getDOMImplementation(); Document document = impl.createDocument(null, "query", null); Node n = q.explain(document); document.getDocumentElement().appendChild(n); TransformerFactory xformFactory = TransformerFactory.newInstance(); Transformer idTransform = xformFactory.newTransformer(); Source input = new DOMSource(document); Result output = new StreamResult(System.out); idTransform.transform(input, output); } catch (FactoryConfigurationError x) { System.out.println("Could not locate a factory class"); } catch (ParserConfigurationException x) { System.out.println("Could not locate a JAXP parser"); } catch (TransformerConfigurationException x) { System.out.println("This DOM does not support transforms."); } catch (TransformerException x) { System.out.println("Transform failed."); } } }