分享
 
 
 

什么是RSS

王朝java/jsp·作者佚名  2006-01-08
窄屏简体版  字體: |||超大  

RSS(Rich Site Summary或者RDF Site Summary)是一种用于网站内容集成的技术。这种最初源自浏览器“新闻频道”的技术,现在却在企业门户(portal)、企业应用集成(EAI)等方面得到了更加宽广的用武之地。

————————————————

What is RSS?

By Mark Pilgrim

RSS is a format for syndicating news and the content of news-like sites, including major news sites like Wired, news-oriented community sites like Slashdot, and personal weblogs. But it's not just for news. Pretty much anything that can be broken down into discrete items can be syndicated via RSS: the "recent changes" page of a wiki, a changelog of CVS checkins, even the revision history of a book. Once information about each item is in RSS format, an RSS-aware program can check the feed for changes and react to the changes in an appropriate way.

RSS-aware programs called news aggregators are popular in the weblogging community. Many weblogs make content available in RSS. A news aggregator can help you keep up with all your favorite weblogs by checking their RSS feeds and displaying new items from each of them.

A brief history

But coders beware. The name "RSS" is an umbrella term for a format that spans several different versions of at least two different (but parallel) formats. The original RSS, version 0.90, was designed by Netscape as a format for building portals of headlines to mainstream news sites. It was deemed overly complex for its goals; a simpler version, 0.91, was proposed and subsequently dropped when Netscape lost interest in the portal-making business. But 0.91 was picked up by another vendor, UserLand Software, which intended to use it as the basis of its weblogging products and other web-based writing software.

In the meantime, a third, non-commercial group split off and designed a new format based on what they perceived as the original guiding principles of RSS 0.90 (before it got simplified into 0.91). This format, which is based on RDF, is called RSS 1.0. But UserLand was not involved in designing this new format, and, as an advocate of simplifying 0.90, it was not happy when RSS 1.0 was announced. Instead of accepting RSS 1.0, UserLand continued to evolve the 0.9x branch, through versions 0.92, 0.93, 0.94, and finally 2.0.

What a mess.

So which one do I use?

That's 7 -- count 'em, 7! -- different formats, all called "RSS". As a coder of RSS-aware programs, you'll need to be liberal enough to handle all the variations. But as a content producer who wants to make your content available via syndication, which format should you choose?

RSS versions and recommendations

Version

Owner

Pros

Status

Recommendation

0.90

Netscape

Obsoleted by 1.0

Don't use

0.91

UserLand

Drop dead simple

Officially obsoleted by 2.0, but still quite popular

Use for basic syndication. Easy migration path to 2.0 if you need more flexibility

0.92, 0.93, 0.94

UserLand

Allows richer metadata than 0.91

Obsoleted by 2.0

Use 2.0 instead

1.0

RSS-DEV Working Group

RDF-based, extensibility via modules, not controlled by a single vendor

Stable core, active module development

Use for RDF-based applications or if you need advanced RDF-specific modules

2.0

UserLand

Extensibility via modules, easy migration path from 0.9x branch

Stable core, active module development

Use for general-purpose, metadata-rich syndication

What does RSS look like?

Imagine you want to write a program that reads RSS feeds, so that you can publish headlines on your site, build your own portal or homegrown news aggregator, or whatever. What does an RSS feed look like? That depends on which version of RSS you're talking about. Here's a sample RSS 0.91 feed (adapted from XML.com's RSS feed):

<rss version="0.91">

<channel>

<title>XML.com</title>

<link>http://www.xml.com/</link>

<description>XML.com features a rich mix of information and services for the XML community.</description>

<language>en-us</language>

<item>

<title>Normalizing XML, Part 2</title>

<link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>

<description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>

</item>

<item>

<title>The .NET Schema Object Model</title>

<link>http://www.xml.com/pub/a/2002/12/04/som.html</link>

<description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description>

</item>

<item>

<title>SVG's Past and Promising Future</title>

<link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>

<description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description>

</item>

</channel>

</rss>

Simple, right? A feed comprises a channel, which has a title, link, description, and (optional) language, followed by a series of items, each of which have a title, link, and description.

Now look at the RSS 1.0 version of the same information:

<rdf:RDF

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns="http://purl.org/rss/1.0/"

xmlns:dc="http://purl.org/dc/elements/1.1/"

>

<channel rdf:about="http://www.xml.com/cs/xml/query/q/19">

<title>XML.com</title>

<link>http://www.xml.com/</link>

<description>XML.com features a rich mix of information and services for the XML community.</description>

<language>en-us</language>

<items>

<rdf:Seq>

<rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/normalizing.html"/>

<rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/som.html"/>

<rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/svg.html"/>

</rdf:Seq>

</items>

</channel>

<item rdf:about="http://www.xml.com/pub/a/2002/12/04/normalizing.html">

<title>Normalizing XML, Part 2</title>

<link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>

<description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>

<dc:creator>Will Provost</dc:creator>

<dc:date>2002-12-04</dc:date>

</item>

<item rdf:about="http://www.xml.com/pub/a/2002/12/04/som.html">

<title>The .NET Schema Object Model</title>

<link>http://www.xml.com/pub/a/2002/12/04/som.html</link>

<description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description>

<dc:creator>Priya Lakshminarayanan</dc:creator>

<dc:date>2002-12-04</dc:date>

</item>

<item rdf:about="http://www.xml.com/pub/a/2002/12/04/svg.html">

<title>SVG's Past and Promising Future</title>

<link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>

<description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description>

<dc:creator>Antoine Quint</dc:creator>

<dc:date>2002-12-04</dc:date>

</item>

</rdf:RDF>

Quite a bit more verbose. People familiar with RDF will recognize this as an XML serialization of an RDF document; the rest of the world will at least recognize that we're syndicating essentially the same information. In fact, we're including a bit more information: item-level authors and publishing dates, which RSS 0.91 does not support.

"

by Mark Pilgrim

Despite being RDF/XML, RSS 1.0 is structurally similar to previous versions of RSS -- similar enough that we can simply treat it as XML and write a single function to extract information out of either an RSS 0.91 or RSS 1.0 feed. However, there are some significant differences that our code will need to be aware of:

The root element is rdf:RDF instead of rss. We'll either need to handle both explicitly or just ignore the name of the root element altogether and blindly look for useful information inside it.

RSS 1.0 uses namespaces extensively. The RSS 1.0 namespace is http://purl.org/rss/1.0/, and it's defined as the default namespace. The feed also uses http://www.w3.org/1999/02/22-rdf-syntax-ns# for the RDF-specific elements (which we'll simply be ignoring for our purposes) and http://purl.org/dc/elements/1.1/ (Dublin Core) for the additional metadata of article authors and publishing dates.

We can go in one of two ways here: if we don't have a namespace-aware XML parser, we can blindly assume that the feed uses the standard prefixes and default namespace and look for item elements and dc:creator elements within them. This will actually work in a large number of real-world cases; most RSS feeds use the default namespace and the same prefixes for common modules like Dublin Core. This is a horrible hack, though. There's no guarantee that a feed won't use a different prefix for a namespace (which would be perfectly valid XML and RDF). If or when it does, we'll miss it.

If we have a namespace-aware XML parser at our disposal, we can construct a more elegant solution that handles both RSS 0.91 and 1.0 feeds. We can look for items in no namespace; if that fails, we can look for items in the RSS 1.0 namespace. (Not shown, but RSS 0.90 feeds also use a namespace, but not the same one as RSS 1.0. So what we really need is a list of namespaces to search.)

Less obvious but still important, the item elements are outside the channel element. (In RSS 0.91, the item elements were inside the channel. In RSS 0.90, they were outside; in RSS 2.0, they're inside. Whee.) So we can't be picky about where we look for items.

Finally, you'll notice there is an extra items element within the channel. It's only useful to RDF parsers, and we're going to ignore it and assume that the order of the items within the RSS feed is given by their order of the item elements.

But what about RSS 2.0? Luckily, once we've written code to handle RSS 0.91 and 1.0, RSS 2.0 is a piece of cake. Here's the RSS 2.0 version of the same feed:

<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">

<channel>

<title>XML.com</title>

<link>http://www.xml.com/</link>

<description>XML.com features a rich mix of information and services for the XML community.</description>

<language>en-us</language>

<item>

<title>Normalizing XML, Part 2</title>

<link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>

<description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>

<dc:creator>Will Provost</dc:creator>

<dc:date>2002-12-04</dc:date>

</item>

<item>

<title>The .NET Schema Object Model</title>

<link>http://www.xml.com/pub/a/2002/12/04/som.html</link>

<description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description>

<dc:creator>Priya Lakshminarayanan</dc:creator>

<dc:date>2002-12-04</dc:date>

</item>

<item>

<title>SVG's Past and Promising Future</title>

<link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>

<description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description>

<dc:creator>Antoine Quint</dc:creator>

<dc:date>2002-12-04</dc:date>

</item>

</channel>

</rss>

As this example shows, RSS 2.0 uses namespaces like RSS 1.0, but it's not RDF. Like RSS 0.91, there is no default namespace and items are back inside the channel. If our code is liberal enough to handle the differences between RSS 0.91 and 1.0, RSS 2.0 should not present any additional wrinkles.

How can I read RSS?

Now let's get down to actually reading these sample RSS feeds from Python. The first thing we'll need to do is download some RSS feeds. This is simple in Python; most distributions come with both a URL retrieval library and an XML parser. (Note to Mac OS X 10.2 users: your copy of Python does not come with an XML parser; you will need to install PyXML first.)

from xml.dom import minidom

import urllib

def load(rssURL):

return minidom.parse(urllib.urlopen(rssURL))

This takes the URL of an RSS feed and returns a parsed representation of the DOM, as native Python objects.

The next bit is the tricky part. To compensate for the differences in RSS formats, we'll need a function that searches for specific elements in any number of namespaces. Python's XML library includes a getElementsByTagNameNS which takes a namespace and a tag name, so we'll use that to make our code general enough to handle RSS 0.9x/2.0 (which has no default namespace), RSS 1.0 and even RSS 0.90. This function will find all elements with a given name, anywhere within a node. That's a good thing; it means that we can search for item elements within the root node and always find them, whether they are inside or outside the channel element.

DEFAULT_NAMESPACES = (None, # RSS 0.91, 0.92, 0.93, 0.94, 2.0

'http://purl.org/rss/1.0/', # RSS 1.0

'http://my.netscape.com/rdf/simple/0.9/' # RSS 0.90

)

def getElementsByTagName(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):

for namespace in possibleNamespaces:

children = node.getElementsByTagNameNS(namespace, tagName)

if len(children): return children

return []

Finally, we need two utility functions to make our lives easier. First, our getElementsByTagName function will return a list of elements, but most of the time we know there's only going to be one. An item only has one title, one link, one description, and so on. We'll define a first function that returns the first element of a given name (again, searching across several different namespaces). Second, Python's XML libraries are great at parsing an XML document into nodes, but not that helpful at putting the data back together again. We'll define a textOf function that returns the entire text of a particular XML element.

def first(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):

children = getElementsByTagName(node, tagName, possibleNamespaces)

return len(children) and children[0] or None

def textOf(node):

return node and "".join([child.data for child in node.childNodes]) or ""

That's it. The actual parsing is easy. We'll take a URL on the command line, download it, parse it, get the list of items, and then get some useful information from each item:

DUBLIN_CORE = ('http://purl.org/dc/elements/1.1/',)

if __name__ == '__main__':

import sys

rssDocument = load(sys.argv[1])

for item in getElementsByTagName(rssDocument, 'item'):

print 'title:', textOf(first(item, 'title'))

print 'link:', textOf(first(item, 'link'))

print 'description:', textOf(first(item, 'description'))

print 'date:', textOf(first(item, 'date', DUBLIN_CORE))

print 'author:', textOf(first(item, 'creator', DUBLIN_CORE))

print

Running it with our sample RSS 0.91 feed prints only title, link, and description (since the feed didn't include any other information on dates or authors):

$ python rss1.py http://www.xml.com/2002/12/18/examples/rss091.xml.txt

title: Normalizing XML, Part 2

link: http://www.xml.com/pub/a/2002/12/04/normalizing.html

description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.

date:

author:

title: The .NET Schema Object Model

link: http://www.xml.com/pub/a/2002/12/04/som.html

description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.

date:

author:

title: SVG's Past and Promising Future

link: http://www.xml.com/pub/a/2002/12/04/svg.html

description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.

date:

author:

For both the sample RSS 1.0 feed and sample RSS 2.0 feed, we also get dates and authors for each item. We reuse our custom getElementsByTagName function, but pass in the Dublin Core namespace and appropriate tag name. We could reuse this same function to extract information from any of the basic RSS modules. (There are a few advanced modules specific to RSS 1.0 that would require a full RDF parser, but they are not widely deployed in public RSS feeds.)

Here's the output against our sample RSS 1.0 feed:

$ python rss1.py http://www.xml.com/2002/12/18/examples/rss10.xml.txt

title: Normalizing XML, Part 2

link: http://www.xml.com/pub/a/2002/12/04/normalizing.html

description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.

date: 2002-12-04

author: Will Provost

title: The .NET Schema Object Model

link: http://www.xml.com/pub/a/2002/12/04/som.html

description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.

date: 2002-12-04

author: Priya Lakshminarayanan

title: SVG's Past and Promising Future

link: http://www.xml.com/pub/a/2002/12/04/svg.html

description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.

date: 2002-12-04

author: Antoine Quint

Running against our sample RSS 2.0 feed produces the same results.

This technique will handle about 90% of the RSS feeds out there; the rest are ill-formed in a variety of interesting ways, mostly caused by non-XML-aware publishing tools building feeds out of templates and not respecting basic XML well-formedness rules. Next month we'll tackle the thorny problem of how to handle RSS feeds that are almost, but not quite, well-formed XML.

Related resources

Sample RSS feeds: RSS 0.91, RSS 1.0, RSS 2.0.

rss1.py

Specifications: RSS 0.90, RSS 0.91, RSS 1.0, RSS 2.0.

Syndic8.com, a directory of 10,000 publicly available RSS feeds.

News Readers in the Open Directory, a variety of client-side and server-side programs for reading RSS feeds.

 
 
 
免责声明:本文为网络用户发布,其观点仅代表作者个人观点,与本站无关,本站仅提供信息存储服务。文中陈述内容未经本站证实,其真实性、完整性、及时性本站不作任何保证或承诺,请读者仅作参考,并请自行核实相关内容。
2023年上半年GDP全球前十五强
 百态   2023-10-24
美众议院议长启动对拜登的弹劾调查
 百态   2023-09-13
上海、济南、武汉等多地出现不明坠落物
 探索   2023-09-06
印度或要将国名改为“巴拉特”
 百态   2023-09-06
男子为女友送行,买票不登机被捕
 百态   2023-08-20
手机地震预警功能怎么开?
 干货   2023-08-06
女子4年卖2套房花700多万做美容:不但没变美脸,面部还出现变形
 百态   2023-08-04
住户一楼被水淹 还冲来8头猪
 百态   2023-07-31
女子体内爬出大量瓜子状活虫
 百态   2023-07-25
地球连续35年收到神秘规律性信号,网友:不要回答!
 探索   2023-07-21
全球镓价格本周大涨27%
 探索   2023-07-09
钱都流向了那些不缺钱的人,苦都留给了能吃苦的人
 探索   2023-07-02
倩女手游刀客魅者强控制(强混乱强眩晕强睡眠)和对应控制抗性的关系
 百态   2020-08-20
美国5月9日最新疫情:美国确诊人数突破131万
 百态   2020-05-09
荷兰政府宣布将集体辞职
 干货   2020-04-30
倩女幽魂手游师徒任务情义春秋猜成语答案逍遥观:鹏程万里
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案神机营:射石饮羽
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案昆仑山:拔刀相助
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案天工阁:鬼斧神工
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案丝路古道:单枪匹马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:与虎谋皮
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:李代桃僵
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案镇郊荒野:指鹿为马
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:小鸟依人
 干货   2019-11-12
倩女幽魂手游师徒任务情义春秋猜成语答案金陵:千金买邻
 干货   2019-11-12
 
推荐阅读
 
 
 
>>返回首頁<<
 
靜靜地坐在廢墟上,四周的荒凉一望無際,忽然覺得,淒涼也很美
© 2005- 王朝網路 版權所有