html2data offers a simple way to transform a HTML file or URL to structured data. For example:
>>> ## start the console
>>> from html2data import html2data
>>> html = """< !DOCTYPE html >< html lang="en" >< head >< /head >
< body >
< h1 >< b >Title< /b >< /h1 >
< div class="description" >This is not a valid HTML
< /body >
< /html >"""
>>> config = {
'map': [
['body_title', u'//h1/b/text()'],
['description', u'//div[@class="description"]/text()'],
]
}
>>> handler = html2data()
>>> received_obj = handler.load(html = html, config=config)
>>> print received_obj
{ 'body_title': 'Title', 'description': 'This is not a valid HTML'}
Requirements:
· Python
· lxml
· httplib2