Exploring .NET Open Source ecosystem: manipulating HTML with HtmlAgilityPack | Andrei Marukovich

Based on my experience, the need of parsing and manipulating HTML appearing surprisingly often. It may be required to clean a HTML file created by tools like Word or FrontPage (these tools are great for the end users, but inject lots of unnecessary information). Or parsing a webpage, or trying to construct a HTML page programmatically.

In all these cases, HtmlAgilityPack may be a handy tool. It allows to load, parse and modify a “real-world” HTML – HTML files which are not necessary clean and well formatted. Even better, for the parsed files, it builds a XML-like DOM which supports XPath and LINQ.

It is easy to learn and the simple example looks like

var doc = new HtmlDocument();
doc.LoadHtml(html);
 
var docNode = doc.DocumentNode;
var content = docNode.Descendants()
                .First(x => x.GetAttributeValue("class", "")
.Equals("icon")).InnerText;

This sample code returns content for the first item with the “icon” class.

This is a simple, but very useful library, so check it out at htmlagilitypack.codeplex.com