Based on my experience, the need of parsing and manipulating HTML appearing surprisingly often. It may be required to clean a HTML file created by tools like Word or FrontPage (these tools are great for the end users, but inject lots of unnecessary information). Or parsing a webpage, or trying to construct a HTML page programmatically.
In all these cases, HtmlAgilityPack may be a handy tool. It allows to load, parse and modify a “real-world” HTML – HTML files which are not necessary clean and well formatted. Even better, for the parsed files, it builds a XML-like DOM which supports XPath and LINQ.
It is easy to learn and the simple example looks like
var doc = new HtmlDocument();
doc.LoadHtml(html);
var docNode = doc.DocumentNode;
var content = docNode.Descendants()
.First(x => x.GetAttributeValue("class", "")
.Equals("icon")).InnerText;
This sample code returns content for the first item with the “icon” class.
This is a simple, but very useful library, so check it out at htmlagilitypack.codeplex.com
blog comments powered by Disqus