Simple Recommendation Engine

What do you do if you have a collection of documents and you want to match them? One option is to develop a service as powerful and complex as Zemanta. But what technique would you use if you would like to do it in the simplest possible way without relying on any external service or software? I was confronted with this challenge recently and, of course, I couldn’t resist it. The result is a simple recommendation engine that I’m presenting in this post.

The approach I’ve taken in implementation of the simple recommendation engine is to first autotag all the posts and then do matching by identifying posts with the most autotags in common with the query post. In effect, I’ve reduced the problem of matching to the problem of auto-tagging.

Zemanta already provides very good auto-tagging functionality, but only as a service. The technology behind this service is way too complex and resource intensive to be used as a standalone software. Instead of using the full Zemanta service, I’ve used a collection of three million documents auto-tagged by Zemanta to learn a set of English words which are especially suitable/popular for tagging. I’ve come with a list of 18K such words that is available here. I have considered a word suitable for tagging if it was used by Zemanta auto-tagger at least 25 times. The resulting list of suitable words is ordered by the ratio between number of times the word was used as a tag by Zemanta auto-tagger and total number of document occurrences of the word. For example, the word django occurs in 1863 out of 3 million documents and it was used 858 times as a tag for these documents. The resulting score of the word django is 858/1863 = 0.46 which makes this word 222nd most suitable word for tagging.

I haven’t done extensive performance evaluation of the engine, but at first glance its performance seems very nice. I’ve included three evaluation datasets that can be used to fine tune the parameters of the engine. Please try engine yourself by cloning/downloading the code from GitHub and issuing the following command

python avc_blog.json unigrams.csv

Let me know in the comments, what do you think of it.

Second Screen and Digital TV

I love good football game very much and Champion’s League finals have been my favorite match of the year ever since the legendary 1999’s finals between Manchester United and Bayern Munich. This year’s finals promised another good game and I went to watch it with some friends to a bar next to Ljubljanica river. There are several bars close by along this stretch of the river each with an outside garden and a big TV for football loving patrons. Sipping beer and watching the game was a very nice experience, but only until we noticed that the people at the neighboring bar got to see the action five seconds before we do. They watched the game on a different television channel with less time delay. After half an hour of hearing about the game first from the shouting people in the bar next door and only seeing it five seconds later from the television screen in our bar, we decided to move next door’s. Watching a football game with a delay just isn’t a fun thing to do.

The modern digital TV has an approximately 2 minutes of delay from the real action, mostly due to the time needed to perform various forms of coding and recoding. In the past this wasn’t such a big problem, since television had the monopoly on transferring information from the field to the living rooms. But recent proliferation of the second screen phenomena is changing this paradigm entirely, since Twitter and Facebook broke the monopoly of big media on real-time information transfer. The following tweet by Werner Vogels explains the situation the best

It will be very interesting to see the effect of this discrepancy. I think this is a great opportunity for some startup to solve.

Cassandra Counters

Yesterday’s frenzy of tracking results of US presidential elections was yet another demonstration of the importance of timely and accurate counting. Knowing how many events has happened is the cornerstone of statistics and statistics is the foundation of all modern science and engineering. Counting has made Google successful and counting is gaining even more prominence recently with the increased use of social signals in algorithms powering many web services.

It is very difficult to implement counters using traditional storage systems in a scalable way, since traditional storage systems operate in a lock-read-update-write-release sequence, thus preventing high frequency updates of counters to be performed in parallel. There were many specialized solutions devised over the years that enabled fast parallel counting, but only with the recent development of cassandra counters we got a general-purpose tool for highly available counters that could be incremented with high frequency.

At Zemanta we have started to use cassandra counters only recently, so we don’t have extensive experience with it yet. But first impression is extremely positive. We are able to process ~100 increments per second effortlessly on a very modest hardware setup. Additonally, counting is very fast with an average time needed to do an increment of ~5ms and maximums rarely exceeding 30 ms.


In April, I wrote how at Zemanta, we constantly move between the Amazon Web Services (AWS) and our dedicated servers in order to optimize costs of our operations. These two options are quite extreme, so recently we started to use the third option provided by German server infrastructure provider Hetzner that is somewhere between AWS and owning your servers in elasticity, but much more cost effective than both of these options. Hetzner provides a quick provision of dedicated servers for a very competitive price and without requiring long time commitment.

AWS is the word leading cloud provider and it is seven times bigger than the nearest competitor. Economies of scale that AWS enjoys are enormous, but also the complexity of providing such service is beyond grasp of a layman. Hetzner took a very different approach and instead of virtual servers it provides access to bare metal at scale and with ease of AWS. For most loads elasticity provided by AWS is an overkill that it’s not worth paying for. The cloud computing is exiting peak of inflated expectations and it seems the pendulum is now moving towards less complex solutions.

While it’s always nice to see engineers dressed up, it’s the non-rack servers that are stealing the show on this photo (source: Hetzner Online AG)

Transitioning from Oracle thru PostgreSQL to MySQL

I was first introduced to the world of relational databases some 15 years ago, while developing geographic information systems (GIS) at Monolit. At that time GIS were just making a transition from storing attribute information in flat files to storing them in relational databases, while relational databases just started their expansion from the world of accounting and financial applications to other application domains. Oracle was especially aggressive in the field of GIS, so we’ve ended up using it. The world of relational databases has since commoditized, but Oracle somehow managed to keep his price very high. Therefore later in my career the free and open-sourced PostgreSQL has increasingly become a viable option in my projects. Except for missing a few very useful features (e.g. partitions), PostgreSQL didn’t feel much different than Oracle, so my concept of a relational database didn’t change a lot.

My perception of relational databases was eventually shattered when I joined Zemanta and I started using MySQL. Deceived by having SQL in its name, I thought MySQL is yet another open-source implementation of trimmed-down Oracle, just like PostgreSQL is. Quite to my surprise I’ve found out that except for the common SQL interface MySQL has very little in common with Oracle and other traditional databases. MySQL is really only about performance, while reliability and consistency, the essence of traditional relational databases, aren’t such a big concern. Consequently, it took me quite a while to acknowledge MySQL for what it is and adjust my perception of relational databases accordingly.

Syncing Offices across the Pond

One of the main conclussions of Zemanta gathering that took place last week, was that Skype calls that we do every Monday in attempt to sync both offices just aren’t working for us. The quality of video and audio provided by Skype, GotoMeeting, and a few other teleconferencing solutions that we have tried is simply inadequate for efficient communication, and maybe even the format of the calls is inadequate. Therefore, I’m reaching out to you for advice how should we do syncing of our New York and Ljubljana offices so that people would feel at least partially as close as we felt last week with everybody being present in the same place.

Zemanta team doing Stand-Up Paddling (photo by @idioterna)


I’ve met Grant Ingersoll at SIGIR in Portland, but unfortunately I had only a short chat with him. In particular I wanted to ask him about ElasticSearch, which has gained quite some prominence lately and it is quickly becoming a serious competition to Solr. Therefore I was glad that our friends at Sematext started a blog post series comparing theses two decent open source search engines built on top of Lucene.

Solr vs. ElasticSearch: Part 1 – Overview

A good Solr vs. ElasticSearch coverage is long overdue. We make good use of our own Search Analytics and pay attention to what people search for. Not surprisingly, lots of people are wondering when to choose Solr and when ElasticSearch. As the Apache Lucene 4.0 release approaches and with it Solr 4.

We need this to understand how you use our service - you can take it out if you like. Cheers, your Blogspire team.


SIGIR’12: State of Semantic Search

The last day of SIGIR has brought a workshop on entity-oriented and semantic search that was attended by some of the leading researchers in the field. The workshop started by John Shafer clearly demonstrating the limits of current web search.

Search results for “best car gps around $300”. Pay attention to the dates of results.

In this example Google gets completely confused by the phrase “around $300” since it doesn’t understand that car gps is a device which has a property price which range can be specified by saying around $300.

While such deep understanding is still quite far away from being used on a large scale by Google, Bing, and others, they are more successful at mining semantics about entities by analyzing data from the web, which seems to be, at least in principle, a solved problem.

Once we have identified entities and their properties, we can search for information on a very different level, as demonstrated by Hannah Bast. The Broccoli search that she has presented works only on English Wikipedia for now, but it has quite some potential to be used also on the wider web.

The feeling I got from the workshop is that semantic search is slowly getting out of trough of disillusionment and is slowly entering slope of enlightenment. Maybe in a few years users will finally get back results for what they mean not for the words they’ve used in their query.

SIGIR’12: Explicit Personalization

Especially from seasoned web users I hear with increasing frequency complaints that they don’t understand anymore how web search engines deliver results to them and that they start to feel as frustrated as when Microsoft introduced the infamous Clippy. I think this is happening because extensive personalization and query reformulation are breaking the mental concept about search engines that users have in their mind.

At today’s industrial track at SIGIR several distinguished speakers from the likes of Google and Microsoft made a compelling case that substantial improvements in relevancy of search results are possible only through more personalization. But search engines don’t know how to efficiently enable people to explicitly personalize their search results, so they must therefore resort to implicit personalization by exploiting their knowledge about the user.

I think that web search engines have become so successful because they have managed to translate chaos of the web into a simple data structure, i.e., a collection of documents, that could be easily sorted through by entering a few keywords. By introducing implicit personalization web search engines might be improving relevancy, but are on the other hand destroying the mental concept of the web as a collection of documents that has made search engines successful in the first place.

SIGIR’12: Exploratory Search

Google earns most of its 20+ billion dollars of revenue from a small percentage of queries which are commercial in nature and where it can leverage its position to influence user’s decision about purchasing a product or selecting a service. The best chance of success in dislodging Google from its piedestal as a gatekeeper to the information on the web has therefore a service which will provide users with better ways how to conduct commercial queries. Anybody who has used Google recently for product research before purchase can testify that Google has anyway become quite useless for such task.

Many commercial queries are exploratory in nature, since users do not want only more of what they already have, but they want also to try new and unknown things. Existing web search engines are particularly bad at supporting such exploratory search where user is unfamiliar with the domain, unsure about the ways to achieve his goals, or even unsure about his goals in the first place. The second day of SIGIR has brought us a great talk by Brent Hecht how to map information on the web to cognitive models that already exist in user’s mind thus greatly improving user’s capability to comprehend information. I think the approach presented in this talk also validates the claim in my yesterday’s post that the greatest opportunities in information retrieval lie in novel user interfaces not in algorithms hidden in the background.