RSS updated
I’ve been working on an update to my homegrown RSS aggregator for a while now. It’s been “nearly” ready for a while, but it took me time to convince myself it was actually ready. It’s funny how quickly I become relatively conservative about pushing out updates to a tool that’s in use, even if it’s just me using it.
I finally shut off the old version and am using the new code exclusively now after running them side-by-side for a couple days and seeing that the new code is doing roughly what I expect. It’s not a huge change. The basic idea is the same as the thing I deployed last October.
The fetcher I deployed last October ran as three processes:
- rssimap: Listened to a Spread channel for new articles. Appended them to an IMAP folder.
- rsssuck: Ran out of cron to fetch and process feeds. Broadcast new and updated articles to a Spread channel.
- rsslog: Listened to a spread channel for log messages. Wrote them to a file. Both of the other two pieces logged to this spread channel.
The new one:
- is a single process run from cron. The one process fetches, processes, and appends new articles to IMAP folders
- has some rules for appending to different folders based on the feed category — so all the photo blogs can go into a seperate folder
- includes more headers in the appended IMAP messages, including a Date header (oops) and some X-* extension headers to make it possible to map a folder message back to the article in the database
- has several bug fixes in the feed handling and uses the latest feedparser release
- uses a (slightly) adaptive scheduler. The old fetcher tried to update every feed every time it was run. The new one adjusts the update period based on how often it sees a feed getting updated.
- is slightly clever about detecting when an article is a dupe — it uses an article hash, the unique link, and (this is the new bit) the feed’s article “guid” , if available, to determine if it’s seen an article before. That caught all of the remaining cases where the old code misread an updated/changed article as a new one.
- Uses apsw instead of pysqlite2 — the segfault incident convinced me apsw was more robust.
I’m pleased. Not yet pleased enough to publish all this experimental code for the world to see, though.
June 21st, 2006 at 12:47 pm
The web site is a playground so please ignore!
I found your blog while researching some stuff on RSS and SQLite.
I have a project which I am going to use these for to connect up some desktop apps, webservices etc.
I saw you have been working on some stuff for some time and wanted to touch base and also ask if you have played around with
http://www.rssbus.com/
I am trying to find ways to use a mail server and RSS bus and SQLite to work with some AJAX interfaces to create some new apps.
Would like to know more of what you are up to and maybe we can talk more.
Regards
William