Text mining could save Digg

Many sites have already reported on the fact that the popular news site, Digg, is overwhelmingly controlled by a very small group of users. Furthermore, some users predict that unless Digg can again become a true interactive community, the site is done for, because it will become repetitive and untrustworthy; neither the word of the masses nor properly edited and authentic journalistic content.

Many of these complaints have one thing in common: They examine the large quantity of front-page (highly “dugg”) articles which are, in fact, duplicates of previously dugg articles from less well-connected users.

And so, I propose that what Digg could do to “save” itself (as though the wildly popular site truly needs a saviour) is to reduce duplication through textual analysis data mining. Or, less technically, by helping users find related, dugg, articles to the one they’re digging or reading.

Side note: What I mean by textual analysis is the practice of scanning normal text and looking for patterns and similarities with other text. This can be done automatically, by an variety of statistical and data mining packages. An obvious use of this sort of tool is contextual ads, as served up by Google AdWords, although you can be far more precise than those simple AdWords if you put the effort into it.

You see, Digg already checks for duplicate articles, but it restricts itself to checking for a submission that contains exactly the same link as another article. This works fine for, say, blog posts, where there is one authoritative source for the article, but it falls down when there are multiple sources.

Multiple sources are common with news sites (an Associated Press or Reuters article will show up identically on tons of different news sources), or breaking stories which are covered by multiple sources. Additionally, many sites reference primary sources, but a Digger may decide that it’s better to Digg the more nuanced reference source rather than the primary source.

In any of these situations, Digg’s automated processes won’t find the original source.

Of course, Digg encourages its users to search before posting, but Digg’s database is huge, and most simple searches are inadequate to find the same post. (And, in some cases, their search won’t even turn up a post if you type in exactly the same title, because the story has been buried or otherwise hidden by some behind-the-scenes process.)

So instead, Digg would be well served to use text-mining tools to classify its articles and relate them to one another. Then, upon digging a link, a user would be faced with not only the “This link is a duplicate” message, but also “These Digg entries are similar to what you posted, please check them and make sure you aren’t duplicating an existing post.” This could be done with far more precision than a search, and would also save Diggers a step, encouraging them to post their articles but to still be circumspect as to whether they’re encouraging duplication.

These related articles could even appear in a sidebar along with their article, thus serving the purpose of helping the masses of Digg readers to bury duplicate articles, as well as providing “additional reading” on related topics. Users clicking on these related articles would increase Digg’s page views (and profitability), and would also potentially promote low-Digg-rank articles which are related to topics which made it to the front page.

I realize that this would be a substantial investment, but it would greatly increase the value of Digg, and make it more open to less popular Diggers. And none of this is rocket science; dozens of sites use these techniques to help their readers find the content they need. In the case of Digg, it can also be used to maintain the highest possible levels of editorial.

Twitter, Facebook

Written on August 23, 2006