So apart from needing to redesign/rebuild my site, I really need to start writing more about my experiments. More for my own record than a public one.
Recently started playing with the whole Nutch family (Hadoop, MapReduce, Hbase, Pig, Solr, Nutch,… etc). I finally got Nutch 2.1 set up with Cassandra 1.2 (that in itself should be another article) with the aim to run data extraction and post to Solr’s Lucene index. Initially I’ve just indexed my own site, but need to inspect the Cassandra data. Been playing with pycassa with some success (the library rocks I just suck at python), and looking at some gui’s. But now settling on CQL as a means to navigate the data. In this post I hope to record the queries I’m yet to use to inspect a Cassandra Nutch data store.
First thing to note, tab completion! yeah baby. No actually, first initialise cqlsh inside your cassandra distribution. And if you did what I did, bin/csqlsh and type help, you’ll see CQL spec 3.0.0 while the docs I’m reading refer to 2.0http://cassandra.apache.org/doc/cql/CQL.html so exit; out of there and open with bin/cqlsh -2 to get into the correct spec.
First line type ‘use web’ and tab to complete for webpage, unless you changed your column family name in gora-cassandra-mapping.xml and from here on in we’re assuming you haven’t changed anything.
This sets up all queries defaulting to this namespace, just like mysql use $schema.
Some simple queries to kick off. Get all baseUrl or url’s out of the system,
SELECT bas FROM f;// get all base urls (seeds) SELECT s FROM f;// get score of all pages SELECT cnt FROM f;// get all content or SELECT c FROM p;// get only text, no html content
And that’s about all I can think of for tonight. Apparently after getting this far with CQL I need to put some additional indexes on Cassandra to be able introduce the groovy where clauses.