RSS to IMAP, the proof of concept
A few months ago I started using Thunderbird’s built-in RSS aggreator. This is the first RSS aggregator that I’ve actually used past the initial playing-with-it phase. It’s okay, but it only lives on one machine. I regularly use at least three machines.
I could use one of the web-based aggregators, but I’m unwilling to put that much state into a 3rd party service I don’t control, especially since some of what I want to subscribe to are company-internal blogs.
So I’ve been toying with writing an RSS to IMAP aggregator. It would read from RSS feeds and write into an IMAP mailbox. IMAP already solves the authenticated-access-from-several-servers problem. With certain mail clients, it addresses offline-reading. It keeps track of which posts I’ve read.
Some searching suggested that a company offered this as a service at one time, but I found no implementations of it that I could download and use. Nor did that company appear to still be offering it as a service.
So with that, I set out to make a proof of concept:
def main():
im = imaplib.IMAP4('**internal ip address**')
im.login('rssfeeds', '**password**')
im.select()
im.expunge()
messages = im.fetch('1:*', '(UID BODY[HEADER.FIELDS (X-RSS-GUID)])')
m, messages = messages
# print messages
guids = {}
if len(messages) > 0 and messages[0]:
messages = [e for e in messages if isinstance(e, tuple)]
for msg in messages:
if len(msg) > 1:
p = Parser()
x = p.parsestr(msg[1], True)
if x['x-rss-guid']:
guids[x['x-rss-guid']] = True
rss = feedparser.parse('http://www.xythian.com/~fox/test.xml')
for i, entry in enumerate(rss.entries):
message = MIMEMultipart()
if entry.has_key('title'):
message['subject'] = entry.title
if entry.has_key('pubDate'):
message['date'] = entry.pubDate
entrydate = entry.pubDate
else:
entrydate = time.localtime()
if entry.has_key('author'):
message['From'] = entry.author
else:
message['From'] = 'rssfeed'
if entry.has_key('guid'):
message['X-RSS-Guid'] = entry.guid
else:
message['X-RSS-Guid'] = entry.link
entry.guid = entry.link
if entry.link:
message['link'] = entry.link
if not guids.has_key(entry.guid):
entry['X-RSS-Source']= 'rss url'
payload = MIMEText('%sn’ % (entry.link, entry.link) +
entry.description, ‘html’, ‘iso-8859-1′)
message.attach(payload)
fp = StringIO()
g = Generator(fp, mangle_from_=False, maxheaderlen=60)
g.flatten(message)
im.append(’INBOX’, (r’Unseen’), entrydate, fp.getvalue())
fp.close()
print ’saving: ‘,entry.guid
else:
print ‘not adding dup’, entry.guid
if __name__ == ‘__main__’:
main()
There are clearly some hackey things going on in here, but it does prove the concept. It reads from an RSS feed and writes into an IMAP store. It does not yet flag the messages as unread nor does it use any folders other than the INBOX. The next step is probably a nearly complete rewrite to seperate the RSS aggregation from the IMAP store. I may decide to use a mysql db to store metainformation like the list of feeds, or it may use an IMAP folder to do it.
The real question which will drive the next phase of development is how will this be deployed and how will I interact with the aggrgeator itself? As I see it the options include:
- Deploy to my home linux box: the obvious choice, I have control over it and can install whatever prerequisite libraries I want.
- Deploy to my dreamhost account: Make it independent from my home environment, less clear how I install what I want, I may have to write it against python 2.2 to do this
- Deploy on my desktop windows machine with a GUI: Not an obvious choice, but one with a certain appeal. My desktop windows machine is on anyway, the core engine will be seperable for a unix deploy, and it makes it clear how to distribute it. This option is mostly appealing from the point of view of making it available and accessible to other people.
For now I’m going to put what time I work on this into the core bits and put off for now how it will be deployed. It is nearly certain the first version will be deployed on my linux machine.
There are still some questions about how it will treat RSS enclosures and image references. It could go ahead and download them and store them in the IMAP message (and rewrite internal links in the RSS body to refer to the attachments). This would be nice from the point of view of the IMAP clients not treating them as external resources (e.g. thunderbird does not by default load external images in mail messages; a behavior I can’t change for a single mailbox and choose not to change for the whole application). It also makes offline reading possible for more feeds.
On the other hand it is much more storage and bandwidth intensive to do that — which is fine if I use the linux machine’s IMAP store but may be less fine if I decide to run the script on my machine but store the data in a dreamhost mailbox.
January 16th, 2005 at 2:56 pm
It’d be cool if it did MIME multipart/alternative to store the RSS entry as a seperate part of the mail message. This would allow mail readers down the road to read the raw RSS and present it as it sees fit. That would allow you to include enclosures in a way that mail clients could take advantage of in the future.
January 16th, 2005 at 4:12 pm
I thought about doing this in the full version (but not in the prototype). One argument against it (for me) is my current lack of plan to point anything that will be able to understand RSS at it combined with my RSS parser not exposing the source XML for a single entry. I do plan to handle RSS enclosures as MIME attachments, though.
If a mail client came along that could understand RSS item attachments to messages, I figure the processing code could be modified to attach it to the messages. In the meantime, it’d be clutter.
March 18th, 2007 at 8:20 pm
[…] I also made an RSS->SMTP sender that picked up the NEW_ARTICLE messages off the Spread message bus and mailed articles to my account. That worked, but was a little slow (since the messages had to go through Speakeasy) and a lot of them got snagged by Speakeasy’s spamassassin install. Rather than put my fetcher address on my Speakeasy whitelist, I resurrected the old RSS->IMAP code I wrote a while ago and refit it to run as a daemon listening to the Spread NEW_ARTICLE group. Now it drops new articles directly into the IMAP folder I set up months ago for this purpose. […]