About Dušan Omerčević

Dušan Omerčević is a lead engineer at Zemanta - Your Blogging Assistant

Piecemeal Feature Development and Merging

Branching and merging is a staple of team software development, since it’s very important that programmers working on the same code base don’t mess with each other’s changes. But while branching is fun (and hassle free in git), merging is quite often a big pain. The longer the programmer stays in his own branch the more his code diverges from master. A good rule of thumb is that branches should be synced with master every day so that the branches do not diverge too much from the master. If the whole feature can be developed in a single day, than merging it back into a master on the same day is not a problem. But if the feature is big and is developed over several days it cannot be merged into the master piecemeally without changing the development practices.

The most important change in development practices is to break down the feature into several smaller pieces where each piece can be merged into master (and preferably deployed to the production) independently. If any of the pieces interferes with the existing functionality, the existing code should be refactored and made more extensible. I’ve heard several programmers objecting to this approach with an argument that it will make development longer. In my experience, the initial development of a complex feature in a piecemeal approach really takes longer, but once the feature is finished it’s really done. On the other hand I’ve heard too many programmers uttering the famous phrase “I’ve finished the feature. I only need to merge it into master” followed by several days of frustrations when the “finished” feature is being shoved into the master.

Measuring User Retention

Products thrive or die depending on their capability to attract and retain users. On the web measuring number of new users is usually not that hard, while measuring retention is usually much more difficult. Namely, most of the services hava a distinct event (registration, installation, activation) that can be used as a basis for counting new users, while users usually abandon the service only slowly and gradually. Measuring user churn therefore cannot be based on a distinctive event, but in most cases can only be measured indirectly by not observing any activity from the user in a given time.

With the problem of measuring user churn solved, you can start measuring user retention. But once you do, you’ll immediately encounter the problem that it’s much bigger loss to loose a long time user than a user who was only doing a quick evaluation of your service. The right approach to solve this problem is to use cohort analysis. For the case of measuring user retention we consider a cohort to be a group of people who became users of our service within a defined period (e.g. on October 29th, 2012). With the cohort defined, the task of the cohort analysis is to track number of people within the cohort that are still using the service in subsequent periods.

Understanding cohort analysis is not easy, so let me present you with an example cohort analysis chart of one of our services.

In the chart above different lines represent user retention rate after X days. For example, the green line represents retention rate after a week. It shows that 79% of users that have joined our service on November 5th was still using our service yesterday, while the positive slope of the green line since October 29th indicates that the improvements we have done to our service in the past two weeks had a favorable effect on our user retention rate.

 

Simple Recommendation Engine

What do you do if you have a collection of documents and you want to match them? One option is to develop a service as powerful and complex as Zemanta. But what technique would you use if you would like to do it in the simplest possible way without relying on any external service or software? I was confronted with this challenge recently and, of course, I couldn’t resist it. The result is a simple recommendation engine that I’m presenting in this post.

The approach I’ve taken in implementation of the simple recommendation engine is to first autotag all the posts and then do matching by identifying posts with the most autotags in common with the query post. In effect, I’ve reduced the problem of matching to the problem of auto-tagging.

Zemanta already provides very good auto-tagging functionality, but only as a service. The technology behind this service is way too complex and resource intensive to be used as a standalone software. Instead of using the full Zemanta service, I’ve used a collection of three million documents auto-tagged by Zemanta to learn a set of English words which are especially suitable/popular for tagging. I’ve come with a list of 18K such words that is available here. I have considered a word suitable for tagging if it was used by Zemanta auto-tagger at least 25 times. The resulting list of suitable words is ordered by the ratio between number of times the word was used as a tag by Zemanta auto-tagger and total number of document occurrences of the word. For example, the word django occurs in 1863 out of 3 million documents and it was used 858 times as a tag for these documents. The resulting score of the word django is 858/1863 = 0.46 which makes this word 222nd most suitable word for tagging.

I haven’t done extensive performance evaluation of the engine, but at first glance its performance seems very nice. I’ve included three evaluation datasets that can be used to fine tune the parameters of the engine. Please try engine yourself by cloning/downloading the code from GitHub and issuing the following command

python sre.py avc_blog.json unigrams.csv

Let me know in the comments, what do you think of it.

Second Screen and Digital TV

I love good football game very much and Champion’s League finals have been my favorite match of the year ever since the legendary 1999’s finals between Manchester United and Bayern Munich. This year’s finals promised another good game and I went to watch it with some friends to a bar next to Ljubljanica river. There are several bars close by along this stretch of the river each with an outside garden and a big TV for football loving patrons. Sipping beer and watching the game was a very nice experience, but only until we noticed that the people at the neighboring bar got to see the action five seconds before we do. They watched the game on a different television channel with less time delay. After half an hour of hearing about the game first from the shouting people in the bar next door and only seeing it five seconds later from the television screen in our bar, we decided to move next door’s. Watching a football game with a delay just isn’t a fun thing to do.

The modern digital TV has an approximately 2 minutes of delay from the real action, mostly due to the time needed to perform various forms of coding and recoding. In the past this wasn’t such a big problem, since television had the monopoly on transferring information from the field to the living rooms. But recent proliferation of the second screen phenomena is changing this paradigm entirely, since Twitter and Facebook broke the monopoly of big media on real-time information transfer. The following tweet by Werner Vogels explains the situation the best

It will be very interesting to see the effect of this discrepancy. I think this is a great opportunity for some startup to solve.

Go disrupt the Web Search! Mobile changes everything!

I gave a talk at Kiberpipa yesterday about the state of the web search. My main point was that the web search is no longer a growing industry and it might already be on the path to irrelevance. The first victim of this change will be the SEO industry which will be pushed out by Google searching for more revenue. But the second victim might be Google itself, since web search is being fundamentally disrupted by the mobile revolution.

Cassandra Counters

Yesterday’s frenzy of tracking results of US presidential elections was yet another demonstration of the importance of timely and accurate counting. Knowing how many events has happened is the cornerstone of statistics and statistics is the foundation of all modern science and engineering. Counting has made Google successful and counting is gaining even more prominence recently with the increased use of social signals in algorithms powering many web services.

It is very difficult to implement counters using traditional storage systems in a scalable way, since traditional storage systems operate in a lock-read-update-write-release sequence, thus preventing high frequency updates of counters to be performed in parallel. There were many specialized solutions devised over the years that enabled fast parallel counting, but only with the recent development of cassandra counters we got a general-purpose tool for highly available counters that could be incremented with high frequency.

At Zemanta we have started to use cassandra counters only recently, so we don’t have extensive experience with it yet. But first impression is extremely positive. We are able to process ~100 increments per second effortlessly on a very modest hardware setup. Additonally, counting is very fast with an average time needed to do an increment of ~5ms and maximums rarely exceeding 30 ms.

Comments are not Code

(A few days back I was again referring to my old post Comments are not Code when talking with my coworker. I published that post some time ago at my former blog and to make it more accessible I’m republishing it here.)

I’m a firm believer that the best software documentation is the running code. If the code is well structured and written, it speaks for itself and it does not need any additional documentation. Comments are not code and therefore should not be used where better code organization would suffice.

A misplaced use of comments that I often see while doing code reviews is to use comments to divide a method into logical subunits. For example:

def check_specific_candidate():

  # first check if we already have X by any chance
  < 10 lines of code, return if true>

  # Try out if candidate is Y
  < 30 lines of code, return if true>

  # candidate is not Y, try out if it is Z
  < another 30 lines of code, return if true>

  # construct a list of elements in the candidate
  < another 30 lines of code>

  if len(list_of_elements) > 0:
    # process list of elements for the candidate
    < another 10 lines of code>

This example is based on actual routine in Zemanta code base that is altogehter 140 lines long. Supporting such code is not a nice experience. While comments in this routine do help, they are actually a symptom of a larger problem, i.e. poor code organization. Comments would immediately become redundant, if this routine would be split into logical steps with each step being a separate routine. Let’s refactor the above routine as such:

def check_specific_candidate(candidate):

  if _candidate_has_X(candidate):
    return

  if _candidate_is_Y(candidate):
    return

  if _candidate_is_Z(candidate):
    return

  list_of_elements = _get_list_of_elements(candidate)
  if len(list_of_elements) > 0:
    _process_list_of_elements(list_of_elements)

So instead of using comments, this routine is now documented using method names. When you approach such code for the first time, seeing such nice 15-lines long routine is much less stressful than seeing a 140-lines long monster.