I've been looking into semantic web services to extract key terms and concepts from user-generated content. Calais and Zemanta both offer rich web services, designed to help you find and integrate relevant and related content from around the web. For my purposes, I'm just interested in the term/concept extraction - which is just a small part of what they provide. Yahoo! has a much more basic service designed to do just that, appropriately named the Yahoo Term Extraction Service.
I decided to do a quick evaluation/comparison, using the following text, from one my delicious bookmarks:
Online Communities: Establishing a Community's Culture - Online Community Report We initiated the Online Community Culture study in October of 2008, as part of the ongoing research agenda of the Online Community Research Network. The intention of the study was to get a broad look at the factors that influence online community culture, and the steps community managers and strategists take in cultivating, and in some cases influencing, a community’s culture. We had over 75 participants in the research, representing many sectors, including software, tech, traditional media, social media and online community, and non-profits. Respondents seniority skewed towards Manager (44%), Directors & VP's (12%).
Online Communities: Establishing a Community's Culture - Online Community Report
We initiated the Online Community Culture study in October of 2008, as part of the ongoing research agenda of the Online Community Research Network. The intention of the study was to get a broad look at the factors that influence online community culture, and the steps community managers and strategists take in cultivating, and in some cases influencing, a community’s culture. We had over 75 participants in the research, representing many sectors, including software, tech, traditional media, social media and online community, and non-profits. Respondents seniority skewed towards Manager (44%), Directors & VP's (12%).
The results from each were quite different. Calais and Zemanta both seem to have more "semantic intelligence" and were able to focus in on the terms that were most relevant to the subject. Calais offered a short, but all relevant list of terms - all extracted directly from the text. Zemanta offered a broader set of terms, including some related terms not explicitly in the text, such as "social network" and "community management". Unfortunately, it also included some unhelpful terms, such as "computers" and "on the web". Yahoo! provided the broadest list of terms, but also the least helpful. With all the resulting terms extracted directly from the text, Yahoo!'s service seems to be mostly a semantic parser, with the least semantic analysis. However, Yahoo's simplicity can be valuable, as well. With other examples, I've seen Calais and Zemanta come up empty (no terms), while Yahoo! provided some relevant, and some not-so-relevant terms. As with people, too much intelligence can be problematic. ;-) Unfortunately, none of the services consistently provide ideal terms. But combined, you might get decent results. That's something I'm continuing to explore. For those interested, the resulting terms from each service are below.
Calais:
Zemanta:
Yahoo!:
Currently rated 4.0 by 1 people
Tags: semantic-web
Tools / Services
2/12/2009 7:12:18 PM #
Hi, Andraz from Zemanta here. I am glad to see tests like this showing that natural language understanding tools can be easily leveraged by developers! Just want to mention that if you want to only discover important terms/concepts found inside text, then you should take a look at "markup" part of the response. We try to 'go broad' with tags, since that's what most people expect from them. I'll be interested to see what you come up with next . And we are always looking for feedback! Have a nice day Andraz Tori, CTO at Zemanta
Andraz Tori
2/13/2009 3:01:24 AM #
Troy: Tom Tague from Calais here. First - thanks for putting this together. We really appreciate it when someone actually tries the tools and reports on results. It's the only way we'll get better. I'd like to encourage you to beat up on the tools with some more challenging text. Something that contains people, organizations, companies, geographies and possibly even events like management changes or natural disasters or something. I believe it's also important to distinguish between "conceptual tagging" and "semantic tagging". With semantic tagging we're trying to get at named entities, facts and events that are hidden within the text. With conceptual tagging we would, essentially, like to emulate what a smart person might tag the text as being "about". Both are valid in their own contexts. Conceptual tagging is good for enhancing search and possibly navigation, semantic tagging is a bit more rigorous and can serve to tie your content to the linked data world, improve filtering and other purposes. In an upcoming release Calais will add conceptual tagging to our toolkit - so, for example, a story about Ferrari and Porsche might be concept tagged "sports cars" and "automotive" - rather than just Ferrari and Porsche. Thanks for the work and please keep us up to date as it progresses. Regards,
Tom Tague
2/13/2009 3:02:53 AM #
Thanks for the tip Andraz - and for providing the service! It is great to see companies like yours leading us into the semantic web frontier.
Troy Sabin
2/13/2009 3:20:38 AM #
Thanks Tom. You certainly have a very rich service - I know I just scratched the surface. In my case, conceptual tagging is really what I'm looking for. I'm glad to hear you'll be adding that to the toolkit. I chose a small text sampling because it represents a good portion of the content I'll be processing. I expect the more text (and context) your have, the better your results. Full blog posts may provide that, but I'll also be processing twitter updates, Facebook messages, and other "micro content". So I was interested to see how you would handle that. I appreciate you pushing the semantic web envelope and and providing a great service. Have you announced a date or timeframe for the upcoming release?
2/13/2009 6:33:02 AM #
Troy: Certainly within the next two months - shooting for sooner. We're working hard to make certain the signal:noise ratio for Concept tags is very high. Last thing we need is a tool to generate 40 unweighted concepts per story. As far as tweets - there are a number of people looking into / experimenting with tagging not so much the tweets (not a lot of context there) - but the links within the tweets. In fact - if anything is shortened via the bit.ly service it is already automatically processed by Calais and available via the bit.ly API. You can see an example of the metadata bit.ly exposes here: http://bit.ly/info/FR9T8. Just a small subset of what's available - but a start. Regards,
2/13/2009 6:45:59 AM #
Great, thanks again Tom. If you have an ongoing or upcoming beta program, I'd be interested in participating. Otherwise, I'll look forward to the public release.
2/16/2009 10:31:29 AM #
Another semantic content extraction / natural language processing API worth checking out is AlchemyAPI. It offers named entity extraction, document categorization, keyword extraction, microformats parsing, and SDKs in a half dozen languages (Java, PHP, Python, Perl, C#, etc) An interactive demo is available here: http://www.orch8.net/api/demo.html If you test it out, we'd love any feedback you have on the service
Elliot
2/16/2009 11:23:23 AM #
Thanks for the suggestion to check out Alchemy. Looks like another good option. For those following, the demo page provided the following terms for the same sample text: Online Community Research Network, Online Community Culture, social media, research, managers, media, community.
2/21/2009 4:20:46 AM #
Tom, I got the Calais R4 beta notice today. Congrats on getting the public beta out. Great to see the Linked Data Cloud support. I don't see anything in the release notes regarding conceptual tagging. Did that make this release?
I'm an Internet technology business strategist, software architect, and development leader specializing in interactive marketing and social media. read more...