We run Confluence in production and we want to run it on more than one server for all the obvious reasons: load balancing and availability. One of the hard nuts to crack in doing that is clustering Lucene.
If you don't know already, Lucene is a "high-performance, full-featured text search engine library". It's used in a lot of software right now including Confluence and JIRA from Atlassian.
Wouldn't it be nice if you could use Lucene normally without having to worry about keeping it consistent all over the place?
Well, Steve and I tried clustering a Lucene index with Terracotta. It worked and I think that's pretty cool.
We used an implementation of the Lucene Directory interface called the RAMDirectory as the index store and made it a clustered object. That's done with a scrap of configuration that tells Terracotta to make our RAMDirectory shared. After that, manipulating the index is business as usual.
Adding to the index looks something like this:
private RAMDirectory directory = new RAMDirectory();
private StandardAnalyzer analyser = new StandardAnalyzer();
private void addToIndex(String name, String number) throws IOException {
IndexWriter writer = new IndexWriter(directory, analyser, directory
.list().length == 0);
Document doc = new Document();
Random r = new Random();
for (int i = 0; i < 100; i++) {
doc.add(new Field("name", name + r.nextFloat(), Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("number", number, Field.Store.YES,
Field.Index.TOKENIZED));
}
System.out.println("Optimizing...");
synchronized (directory) {
long start = System.currentTimeMillis();
writer.addDocument(doc);
writer.optimize();
writer.close();
System.out.println("Took:" + (System.currentTimeMillis() - start));
}
}
Reading from the index looks something like this:
private void queryIndex(String name) throws ParseException, IOException {
QueryParser parser = new QueryParser("name", analyser);
Query query = parser.parse(name);
BooleanQuery.setMaxClauseCount(100000);
IndexSearcher is = new IndexSearcher(directory);
long start = System.currentTimeMillis();
Hits hits = is.search(query);
System.out.println("Took:" + (System.currentTimeMillis() - start));
System.out.println("Hits:" + hits.length());
for (Iterator i = hits.iterator(); i.hasNext();) {
Hit hit = (Hit) i.next();
Document doc = hit.getDocument();
}
}
As you can see, dealing with the index is pretty simple. This example code isn't really any different with clustering enabled than it is without clustering. In fact, turning clustering on and off is as simple as invoking java with or without a couple of Terracotta options.
Keeping this same RAMDirectory consistent across multiple JVMs without transparent object clustering would a real hassle. Just to keep the indexes up to date by hand, you'd have to trap changes to the them and then somehow send those changes out to the other JVMs and apply them. Keeping the indexes consistent is even harder.
With transparent object clustering, you don't have to worry about how the data moves around and you can keep your code simple and to the point without a bunch of clustering gunk dirtying things up.
Here's the interesting part of the config I mentioned earlier:
<locks>
<!-- This part declares the locks that should be acquired. The-->
<!-- "named-lock" stanzas declare that a lock with a given name will be-->
<!-- acquired before the method(s) matching the given method regular-->
<!-- expression are called. --<
<named-lock>
<lock-name>lockOne</lock-name>
<method-expression>* org.apache.lucene.demo.PhoneIndexer.queryIndex(..)</method-expression>
</named-lock>
<named-lock>
<lock-name>lockOne</lock-name>
<method-expression>* org.apache.lucene.store.RAMDirectory.renameFile(..)</method-expression>
</named-lock>
<named-lock>
<lock-name>lockOne</lock-name>
<method-expression>* org.apache.lucene.store.RAMDirectory.createOutput(..)</method-expression>
</named-lock>
<named-lock>
<lock-name>lockOne</lock-name>
<method-expression>* org.apache.lucene.store.RAMDirectory.deleteFile(..)</method-expression>
</named-lock>
<named-lock>
<lock-name>lockOne</lock-name>
<method-expression>* org.apache.lucene.store.RAMOutputStream.close(..)</method-expression>
</named-lock>
<!-- The "autolock" stanzas declare that any concurrency primitives-->
<!-- (synchronized, wait, notify) found in methods matching the given-->
<!-- method regular expression (in this case, every method) will be-->
<!-- clustered if synchronized, wait, or notify are called on a clustered-->
<!--object in that method -->
<autolock>
<method-expression>* *..*.*(..)</method-expression>
</autolock>
</locks>
<!-- This part declares the RAMDirectory object as the root of a clustered object graph.-->
<!-- Everything referenceable by this root object also becomes a shared object. -->
<roots>
<root>
<field-name>org.apache.lucene.demo.PhoneIndexer.directory</field-name>
</root>
</roots>
The code we used was expedient for our purposes, but not particularly easy to use. I'll post polished example code and instructions on how to run the test yourself next time.
7 comments:
Great post! Fascinating. I like the idea of a clustered RAMDirectory, I'm going to have to explore this myself and see if we can't offer it as an option.
Clustering Confluence itself though is a little more complex (just to warn you), for more detail see my pragmatic clustering post.
Cheers,
Mike
I'm sure there's a lot more to clustering Confluence than clustering the indexes, but it seemed like an interesting problem not easily solved.
As the main post said, I'll be putting up a full-fledged example as soon as I get it polished up in a way that's easy to try.
Cheers,
Orion
What happens when the index starts getting large?
With the application we are working on, the index may become quite large (in the gigabytes). We have done clustering the way you first mentioned; maintaining the index on each node, and using JGroups to distribute messages for indexing and deletion.
Very Interesting, looks like this will update every RAMDirectory on each node, Then a “timer” or criteria can be set to flush the RAMDirectory to FSDirectory(that will handle the gigabyte problem a previous poster stated)… The searcher part needs some work :P, technically it needs to search the first “available” FSdirectory in the cluster maybe?
I look forward to the full example using the RAMDirectory/FSDirectory to store the index.
Lata,
Jeryl
Regarding the "how big can it get?" question, I've posted an answer here: Clustering Lucene, Part II: An Example You Can Try Yourself
Regarding Pharaoh's comment about flushing to an on-disk store: that's an interesting idea. I'll talk to the team about it. But, you might not need to do all that work, since the clustered RAMDirectory is made persistent by DSO. And, the updates are only sent where they are needed—only the JVMs that have references to the changed parts of the index will see the changes. Not all JVMs need to have the entire index in RAM at once, they won't need to apply changes to the parts they don't have in RAM, and they will automatically fault in the pieces they don't have as they need them.
Q: Is this safe for concurrent use?
Perhaps RAMDirectory is safe for concurrent updates/reads but I would guess not (need to check).
Specifically I didn't think Lucene indexes where generally safe for reading while an update was occuring. If this is the case for RAMDirectory then how are you handling that issue?
I generally use two indexes to enable this in the past. Where one is written to, and then readers migrate over to the 'new version' of the index.
Perhaps I am missing something, or you lock out readers during your updates?
Cheers, Rob.
There's now a Terracotta configuration bundle for Lucene. Check out the Lucene integrations page: Lucene Integration
Post a Comment