Fetch, parse, index: a professional tool set in Python

Pubblicato: 29 settembre 2009 in Python

This is my proposal for the pycon 2010 in Atlanta:

Studio Quattro Informatica is a small firm based in Italy specializing in textual semantic analysis and aggregators.
I will introduce some tools and techniques we currently use in our main projects, including:

And here’s the proposed outline:

  • Motivation: the Devil is in the details: The core of our projects is a semantic engine entirely written in Python. In my presentation I will wittingly ignore other important tools such as neural networks and natural languagetoolkits to focus on the packages that helped us to handle many crucial details. Thanks to these packages we were able to work faster and happier and, for once, to kill the Devil.
  • Process: fetch, parse, index – The engine is built around a stream of textual contents that flows from the Internet to our database. The fetcher collect web pages and entries ofRSS feeds, from there a group of parsers tries to extract useful information and give it to the indexing system.
  • Fetch: be nice, use cache – I’m not an expert of caching http requests but I don’t have to. httplib2 is a wonderful library that does everything for you: “Project goal – To become a worthy addition to the standard Python library”.
  • Parse: no addition required – The standard HTMLParser does a nice job in what we need in this phase: stripping tags, handling charrefs and entityrefs and saving links to images or other pages.
  • Index: how do you recognize perfection? – “Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators”.


