Hello!

Thanks to http://www.wordle.net/, I’ve managed to compute a tagcloud representation of the queries within the (in)famous AOL query log. The output picture is pretty nice… Hope you’ll like it!

 

Comments 2 Comments »

#include <iostream.h> #include <fstream> #include <string> #include <vector> #include <string.h> #include <errno.h> class trecReader {         public:                 trecReader (char* fN) {                         this->inStream = new ifstream(fN);                         if (!this->inStream) {                                 cerr << "Error opening " << fN << ": " << strerror(errno) << endl;                                 exit (-1);                         }                 }                 ~trecReader() {                         if (this->inStream != NULL) {                                 delete this->inStream;                         }                 }                 bool getNextDoc(vector<string>& lines) {                         string sLine;                         bool   retVal = false;                         lines.clear();                         if (getline(*inStream,sLine)) {                                 // Read the file line-by-line... each line is stripped off of the last \n.                                 bool inHeader = false;                                 bool inDoc = false;                                 while (sLine.substr(0,5) != "<DOC>") {                                         getline(*inStream,sLine);                                 }                                 if (sLine.substr(0,5) == "<DOC>") {                                         getline (*inStream,sLine);                                         if (sLine.substr(0,7) == "<DOCNO>") {                                                 getline (*inStream,sLine);                                         }                                         if (sLine.substr(0,10) == "<DOCOLDNO>") {                                                 getline (*inStream,sLine);                                         }                                         if (sLine.substr(0,8) == "<DOCHDR>") {                                                 inHeader = true;                                                 while (inHeader) {                                                         getline (*inStream,sLine);                                                         if (sLine.substr(0,9) == "</DOCHDR>") {                                                                 inHeader = false;                                                                 getline (*inStream,sLine);                                                         }                                                 }                                         }                                         inDoc = true;                                         while (inDoc) {                                                 if (sLine.substr(0,6) == "</DOC>") {                                                         inDoc = false;                                                         retVal = true;                                                 }                                                 else {                                                         lines.push_back(sLine);                                                         getline (*inStream,sLine);                                                 }                                         }                                 }                         }                         return (retVal);                 }         private:                 ifstream*    inStream; };

Comments No Comments »

Dear all,

If you are trying to compile the example at  http://goog-sparsehash.sourceforge.net/doc/sparse_hash_map.html on a Ubuntu linux box you will obtain an error saying that the hash template cannot be found… Just add using __gnu_cxx::hash; to the list of namespaces.

This is the new example code

#include <iostream> #include <google/sparse_hash_map> using namespace std;      // namespace where class lives by default using namespace google; using __gnu_cxx::hash; struct eqstr {         bool operator()(const char* s1, const char* s2) const {                 return strcmp(s1, s2) == 0;         } }; int main() {         sparse_hash_map<const char*, int, hash<const char*>, eqstr> months;         months["january"] = 31;         months["february"] = 28;         months["march"] = 31;         months["april"] = 30;         months["may"] = 31;         months["june"] = 30;         months["july"] = 31;         months["august"] = 31;         months["september"] = 30;         months["october"] = 31;         months["november"] = 30;         months["december"] = 31;                 months.set_deleted_key("");         months.erase("may");         months["may"] = 31;                 cout << "september -> " << months["september"] << endl;         cout << "april     -> " << months["april"] << endl;         cout << "june      -> " << months["june"] << endl;         cout << "november  -> " << months["november"] << endl;         cout << "may       -> " << months["may"] << endl; }

Comments No Comments »

Dear All,

many people today are studying query logs in order to obtain a view on what users usually look for on real-world search engines.
I’m putting here the Excite query logs we used in our experiments to let other people use them in their researches. All of these query logs have been publicly available by their respective owners.

Excite logs

1997 Small version

1997 Integral version

1999 Integral version

Comments 1 Comment »

I just got to know that a paper of ours (me and other guys at Yahoo! Research Lab in Barcelona) just got accepted by a special issue of ACM Transactions on the Web (TWEB)… I’m really thrilled about it.

 

Here some details about the paper. Its title is “Design trade-offs for search engine caching” and my co-authors in there are:

  • Ricardo Baeza-Yates
  • Aristides Gionis
  • Flavio Junqueira
  • Vanessa Murdock
  • Vassilis Plachouras

Abstract:

 

 

In this paper we study the trade-offs in designing efficient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists. Using a query log spanning a whole year we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of finding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log affect the effectiveness of static caching, given our observation that the distribution of the queries changes slowly overtime. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.

 

Comments No Comments »

After I’ve experienced a crash in my mySQL database on the previous machine, I’m happy to announce that my website is up and running again!

Comments No Comments »