RW’s Blog
DATA MINING, MACHINE LEARNING AND MORE

Analyzing Social Bookmarking Systems: A del.icio.us Cookbook

July 10th, 2008 . by rw

del.icio.usWe recently analized the structure and dynamics of social bookmarking systems. For this purpose, we spent six months on crawling about 142 million del.icio.us bookmarks coming from around 1 million users. I will give an overview of our findings here. Details can be found in this paper which I am going to present at the ECAI 2008.

The growth of del.icio.us

According to this blog, the del.icio.us bookmarking site went online in Sep. 2003 and, as our data indicates, has seen an exponential growth since then. According to our dataset, there where over 7,305,559 newly added bookmarks and 47,429 newly appearing del.icio.us users in December 2007.

There is an interesting period in the first half of 2006 where del.icio.us didn’t encounter much growth at all, but plateaud at around 3.5 million new bookmarks a month. This pattern was also reported by other authors, but the reason remains unclear.

The monthly growth of del.icio.us between 2004 and 2008 by posted bookmarks, new users, new URLs and new tags.
The monthly growth of del.icio.us between 2004 and 2008 by posted bookmarks, new users, new URLs and new tags.

Bookmarking patterns

To no surprise, we find that the del.icio.us community is biased toward web community and web technology related content. The most popular sites are listed in the table below.

Top 10 most frequent URLs in the corpus
Top 10 most frequent URLs in the corpus.

However, the above table does not fully reflect the popularity of most services which use deep-links for content, such as news or content sharing portals. The next table therefore additionally lists the most popular domains within our corpus.

Top 10 most frequent domains in the corpus
Top 10 most frequent domains in the corpus

We find the user activity to follow a power law distribution with few users being responsible for a high number of posts. This is illustrated by the fact that the Top 1% of users proliferates 22% of all bookmarks, the Top 10% contribute 62%. We assume parts of this tendency to be triggered by spam users (see below).

Another power law dependency can be found for the occurrence frequencies of URLs where 39% of all bookmarks link to the Top 1% of URLs and 61% to the Top 10%. Furthermore, we find that 80% of all URLs appear only once in the corpus. The URL distribution seems less polluted by spam as users can bookmark an URL only once.

Tagging patterns

We find each bookmark to be labeled with 3.16 tags on average. The tags assigned to a bookmark can perform different functions. The authors of this paper identify seven tagging purposes the most relevant being the assignment of tags for describing the topic and the type of bookmarked resources. Our analysis underlines these findings as can be seen in the table below which lists the 20 most frequent del.icio.us tags.

The vocabulary of del.icio.us users seems to be highly standardized. Even so, there exist around 7 million tags in our corpus only 700 account for 50% of all assignments. This convergence is likely to be supported by the tag recommendation mechanisms provided by del.icio.us which suggests tags based on own or
other users previous labels. 55% of all tags were found to appear only once in the data.

Top 20 most frequent tags in the corpus
Top 20 most frequent tags in the corpus

Tendencies in the del.icio.us tag distribution strongly correlate with external events as shown in the figure below. The figure presents the dynamics of 5 sample tags in 2007. As can be seen from the the time series, the tagging trends reflect both the upcoming of new technologies, such as Google’s ’Android’ announced in early November 2007, and periodic events, such as ’Christmas’. The delay between an external event and its echo on del.icio.us seems marginal.

Occurrence of 5 sample tags in 2007 as percentage of overall assignments.
Occurrence of 5 sample tags in 2007 as percentage of overall assignments.

Spam

An initial analysis of our corpus revealed a frequent occurrence of bookmarks, which presumably are spam and were posted by automated mechanisms. As with applications such as email, the impact of spam is severe for social bookmarking, too. An analysis of the Top 20 most active del.icio.us users uncovered 19 users of apparently non human origin posting tens of thousands of URLs pointing to only few domains. These 19 ’users’ alone account for 1,321,316 bookmarks – around 1% of the corpus. Unfortunately, this result comes to no surprise since del.icio.us offers an API for remote postings and URLs that appear on del.icio.us have the potential of reaching thousands of users.

Generally, we find spammers to exhibit one or more of the following characteristics:

  • Very high activity. Automated posting routines may reach much higher participation rates than human users.
  • Few domains. The URLs posted by spam users are likely to belong to a very small set of domains. Figure 6 plots the number of posts going to a domain versus the number of users bookmarking this domain. As the figure shows, many domains receive a very high number of postings coming from only a few users.
  • The number of bookmarks compared to the number of users linking to a domain.
    Figure 6: The number of bookmarks compared to the number of users linking to a domain.

  • High tagging rate. Some spam users tend to label their bookmarks with an exorbitant number of tags in order to increase their visibility.
  • Very low tagging rate. Other spam users seem to not care about tagging at all, but constantly upload bookmarks without any tags probably to increase the number of incoming links on their domain(s).
  • Bulk posts. Bulk uploads are a strong indicator of automated postings. However, automated postings may also appear for human users, e.g. if a user synchronizes his local bookmarks with
    del.icio.us using existing software tools.
  • Combinations of the above. In most cases, we find a combination of the above characteristics. Figure 7, for example, shows the correlation between the number of bookmarks a user has and the average number of tags he assigns to each bookmark. As can be seen from the figure, some users tend to have very high values in both dimension and can thus easily be identified as spammers.

The number of user bookmarks compared to the average number of tags assigned by the user.
Figure 7: The number of user bookmarks compared to the average number of tags assigned by the user.

Conclusions

The del.icio.us bookmarking service provides a valuable source for information retrieval and social data examination. However, we found, that spam highly distorts any analysis. If time permits, we might investigate different machine learning approaches for automated spam detection in the future.

Share/Save/Bookmark


5 Responses to “Analyzing Social Bookmarking Systems: A del.icio.us Cookbook”

  1. comment number 1 by: Charles

    Hi
    Very intresting study and fascinating results. I was surprised of the “3.16 tags” pr. bookmark. Did you compare these tags to the Delicious recommender system?

  2. comment number 2 by: rw

    Hi Charles,

    we did not do any comparison with the del.icio.us tag recommender (if that’s what you meant). What exactly would you be interested in?

  3. comment number 3 by: Björn

    What are the “seven tagging purposes”? Are they enumerated in the paper by Golder and Huberman? That’s not clear to me from the text of your paper.

  4. comment number 4 by: Dan Dascalescu

    Fascinating! Thanks for the work and for sharing.

  5. comment number 5 by: rw

    Hi Björn,

    in “Usage patterns of collaborative tagging systems” Golder and Huberman identified the following functions tags perform for bookmarks:

    (1) Identifying what (or who) it is about.
    (2) Identifying what it is.
    (3) Identifying who owns it.
    (4) Refining categories.
    (5) Identifying qualities or characteristics.
    (6) Self reference.
    (7) Task organizing.

    See here for details:

    http://www3.isrl.uiuc.edu/~junwang4/langev/localcopy/pdf/golder05taggingSystems.pdf

    From my experience, I would like to add that many people tend to apply tags for later retrieval which results in certain consistent tagging patterns per user.

    Furthermore, I believe that the tag suggestions which appear when posting a new bookmark strongly influence the tagging behavior. But I am not aware of any study on this topic.

Leave a Reply

Name

Mail (never published)

Website