Quantcast
Channel: Html Agility Pack
Viewing all 1060 articles
Browse latest View live

Commented Issue: incorrect parse

$
0
0
please try parse this and you will see a problem:

<form class="patrol" method="post" id="patrolForm">
<p class="timeleft">150</p>
</form>
Comments: ** Comment from web user: Siderite **

The fix seems to be modifying HtmlNode.cs from
ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);
to
ElementsFlags.Add("form", HtmlElementFlag.CanOverlap);

Workaround without modifying the source:
HtmlNode.ElementsFlags["form"]=HtmlElementFlag.CanOverlap;
(static change, applies on the entire application)


Patch Uploaded: #12112

$
0
0

SychevIgor has uploaded a patch.

Description:
Solution files for .net 4.5 beta and vs11 beta.
I already use .net 4.5 for prototyping some aps and 4.5 .net version is necessary for me.

Just add files in their folders from the archive

Reviewed: 1.4.0 Stable (May 03, 2012)

$
0
0
Rated 5 Stars (out of 5) - Excellent library. I felt the need to correct for the two star review given for experiencing errors that were most likely the result of PEBKAC.

Commented Issue: Incorrect parsing of HTML4 optional end tags [29218]

$
0
0
Parsing valid HTML 4.01 that omits optional end tags results in an incorrect parse tree.

For example,
input: <ul><li>a<li>b</ul>
is parsed as: <ul><li>a<li>b</li></li></ul>
but should be: <ul><li>a</li><li>b</li></ul>

In practice, this means that quite commonly html documents are parsed incorrectly. The bug does not apply to *all* optional end tags; e.g. <p> tags are correctly auto-closed.

Example document:
<html><body>
<ul><li>TestElem1
<li>TestElem2
<li>TestElem3 List:
<ul><li>Nested1
<li>Nested2</li>
<li>Nested3
</ul>
<li>TestElem4
</ul>
<p>paragraph 1
<p>paragraph 2
<p>paragraph 3
</body></html>
Comments: ** Comment from web user: gmacgregor **

Hi Eamon,

I made one update to your patch for this fix. I have a page which needs the optional end tags closed but the tags in the page are all caps. So on line 1148 in the function FixNestedTags I changed:

string currentNodeName = CurrentNodeName();

To:

string currentNodeName = CurrentNodeName().ToLower();

Is this something you will incorporate into your patch and/or next release?

Thanks

Glenn

Commented Issue: SelectSingleNode searches from root when called from child [32539]

$
0
0
Even after having isolated to a child node via SelectNodes(), applying SelectSingleNode() to the child will search from the root above (presume from root, above anyway). This behaviour is unexpected, I expect it to search from the isolated child and below only.

Example:

<HTML>
<HEAD><TITLE> HTML Agility Bug Demo</TITLE></HEAD>
<BODY>
<somestuff>stuff here</somestuff>
<table>
<tr><td>first row</td></tr>
<tr><td>second row</td></tr>
<tr><td>third row</td></tr>
</table>
</BODY>
</HTML>

HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.Load(@"HtmlAgilityBugDemo.html");
HtmlNodeCollection rowNodes = doc.DocumentNode.SelectNodes("//table/tr");
foreach (HtmlNode row in rowNodes)
{
string test1 = row.InnerText; // Works, enumerates correctly
string test2 = row.SelectSingleNode("//td").InnerText; // This ALWAYS returns "first row" !!
string test3 = row.SelectSingleNode("//somestuff").InnerText; // Found somestuff. But no stuff within this node !!
}
Comments: ** Comment from web user: kimotun **

the issue come from this function
private HtmlDocument LoadUrl(Uri uri, string method, WebProxy proxy, NetworkCredential creds)
{
HtmlDocument doc = new HtmlDocument();
doc.OptionAutoCloseOnEnd = false;
doc.OptionFixNestedTags = true;
_statusCode = Get(uri, method, null, doc, proxy, creds);
if (_statusCode == HttpStatusCode.NotModified)
{
// read cached encoding
doc.DetectEncodingAndLoad(GetCachePath(uri));
}
return doc;
}
just change the variable doc.OptionFixNestedTags = false;

Commented Issue: SelectSingleNode searches from root when called from child [32539]

$
0
0
Even after having isolated to a child node via SelectNodes(), applying SelectSingleNode() to the child will search from the root above (presume from root, above anyway). This behaviour is unexpected, I expect it to search from the isolated child and below only.

Example:

<HTML>
<HEAD><TITLE> HTML Agility Bug Demo</TITLE></HEAD>
<BODY>
<somestuff>stuff here</somestuff>
<table>
<tr><td>first row</td></tr>
<tr><td>second row</td></tr>
<tr><td>third row</td></tr>
</table>
</BODY>
</HTML>

HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.Load(@"HtmlAgilityBugDemo.html");
HtmlNodeCollection rowNodes = doc.DocumentNode.SelectNodes("//table/tr");
foreach (HtmlNode row in rowNodes)
{
string test1 = row.InnerText; // Works, enumerates correctly
string test2 = row.SelectSingleNode("//td").InnerText; // This ALWAYS returns "first row" !!
string test3 = row.SelectSingleNode("//somestuff").InnerText; // Found somestuff. But no stuff within this node !!
}
Comments: ** Comment from web user: kimotun **

it's not xpath question. I will try to fix it later.

Created Issue: Problem with nuget package and Windows Phone [32678]

$
0
0
Hi,

If I have a Windows Phone 7.0 project and get HtmlAgilityPack from nuget, it works fine. But if the project is for Windows Phone 7.1, I get compile time errors about XPath. Forcing VS to use the WP7.0 version by replacing lib\sl4-windowsphone71 with the contents of lib\sl3-wp fixes the problem. Do you mind fixing the published nuget package?

Best Regards,
Gustavo Guerra

New Post: how to get attribute value plz

$
0
0

 

hello,
i need to get all the title attributes value ,only those who in the <td class = val>

Code:
<td class="val"><img class="r3" src="img/x.gif" alt="ind" title="ind" />
750 </td>

New Post: Possibility to be added as developer for creating test cases

$
0
0

Hi

Is it possible to be added to the project so I can create tests cases for HtmlAgilityPack, which it definately lacks.

New Post: Possibility to be added as developer for creating test cases

$
0
0

yes you can add the dll file to your project

New Post: how to get attribute value plz

$
0
0

Here's how I would do it:

HtmlWeb.LoadAsync(url, handleResults);

public void handleResults(object o, HtmlDocumentLoadCompleted args)
{
     IEnumerable<HtmlNode> columns = args.Document.DocumentNode.Descendants("td").Where(x => x.GetAttributeValue("class", "").Equals("val"));

}

If you're looking to get the title inside of each image inside of the column, and assuming each column has exactly one image in it, you could add to this,

IEnumerable<string> titles = columns.Select(x => x.Descendants("img").Single().GetAttributeValue("title", ""));

New Post: How to use HAPPhone APIs to achieve this function

$
0
0

Does the xml look like this: <text>Text you're trying to read</text>

Assuming the above is true, you can create a list of text strings like this:

HtmlWeb.LoadAsync("http://feeds.feedburner.com/nownews/politic", handleResults);

public void handleResults(object o, HtmlDocumentLoadCompleted args)
{
     IEnumerable<string> textList = args.Document.DocumentNode.Descendants("text").Select(x => x.InnerText);

}

New Post: How to check HtmlWeb.LoadAsync finished

$
0
0

Not sure if you thought of this, but why don't you just put all of the code that depends on the loaded content into the method that gets called when LoadAsync finishes? So everything after GetContent(index) in OnNavigatedTo() gets moved into LoadAsync() after

this.Magazines[index].ContentNode = contentNode.InnerHtml;

New Post: SelectNodes to return empty HtmlNodeCollection if no matching node is found

$
0
0

I thought SelectNodes would be used in a LINQ fashion .... I was shocked when it returned null while I was trying to enumerate few elements ...

come on guys its 2012 and this is LINQ world now ... it should return an empty collection rather than null 

New Post: SelectNodes to return empty HtmlNodeCollection if no matching node is found

$
0
0

Then use the LINQ interfaces of Descendants and Elements.

SelectNodes matches the .NET System.Xml API for which it was built upon.

I agree this should be overhauled to return an empty list but the fact is there are thousands of applications out there that could break if this is changed in a minor point release. There were new LINQ like functions added of Descendants, Elements and DescendantNodes to more closely match the LINQ to XML interface.

This is one of the things I wanted to address in 2.0 if I can ever get the time to work on it. The actual current implementation with HtmlNodeCollection needs to be tossed out and replaced with a yield return and an IEnumerable<HtmlNode> return type;

For the people that want this, an extension method can solve it

public static IList<HtmlNode> SelectNodesAsList(this HtmlNode node, string xpath)
{
	var list = node.SelectNodes(xpath);
	if (list == null)
		return new List<HtmlNode>();

	return list;

}

Commented Issue: HtmlAgilityPack v1.4.3 parses tables wrong [32107]

$
0
0
I have installed HtmlAgilityPack via NuGet and it installed version 1.4.3

This version has an error when handling tables!

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>hap test table</title>
</head>
<body>
<table>
<tr>
<td>foo</td>
<td>bar</td>
</tr>
</table>
</body>
</html>

becomes

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>hap test table</title>
</head>
<body>
<table>
<tr>
<td>foo
<td>bar
</td></td></tr></table>
</body>
</html>

If I go back to version 1.4.0 then it works like it should...
Comments: ** Comment from web user: tom103 **

It seems this problem is related to HtmlWeb; it occurs if you load the document with HtmlWeb.Load, but NOT if you load it with WebClient.OpenRead + HtmlDocument.Load...

Commented Issue: HtmlAgilityPack v1.4.3 parses tables wrong [32107]

$
0
0
I have installed HtmlAgilityPack via NuGet and it installed version 1.4.3

This version has an error when handling tables!

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>hap test table</title>
</head>
<body>
<table>
<tr>
<td>foo</td>
<td>bar</td>
</tr>
</table>
</body>
</html>

becomes

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>hap test table</title>
</head>
<body>
<table>
<tr>
<td>foo
<td>bar
</td></td></tr></table>
</body>
</html>

If I go back to version 1.4.0 then it works like it should...
Comments: ** Comment from web user: tom103 **

Further investigation in the code shows that HtmlWeb forces OptionFixNestedTags to true, which causes the issue. There should be a way to control these flags (PreHandleDocument occurs too late, so it's not a viable option)

Reviewed: 1.4.0 Stable (May 18, 2012)

$
0
0
Rated 5 Stars (out of 5) - It's rare to find such quality open source libraries for .NET. This one is a winner. And a keeper.

Commented Issue: Cannot find type System.Xml.XPath.XPathNavigator in module System.Xml.dll [32616]

$
0
0
Happens as soon as I include the HtmlAgilityPack.dll reference and build the project.
Comments: ** Comment from web user: aaronbrodersen **

This is an issue with building a Metro styel app using the new WinRT api. Hopefully a change will be implemented soon.

In the mean time, here is a fork of this project that has found a solution to the issue:
https://bitbucket.org/antiufo/htmlagilitypack-for-metro

New Post: Removing element by class

$
0
0

I'm sure Dave is passed his problem, but I was hitting this tonight and I eventually found a solution.  Posting to help others:

Here's what worked for me to remove an element using its class:

                HtmlAgilityPack.HtmlDocument htmldoc = new HtmlDocument();
                htmldoc.LoadHtml(value.Element(aw + "content").Value);
                var divs = htmldoc.DocumentNode.SelectNodes("//div");
                if (divs != null)
                {
                    foreach (var tag in divs)
                    {
                        if (tag.Attributes["class"] != null)
                        {
                            if (string.Compare(tag.Attributes["class"].Value, "feedflare", StringComparison.InvariantCultureIgnoreCase) == 0)
                            {
                                tag.Remove();
                            }
                        }
                    }
                }

                Description = (htmldoc.DocumentNode.OuterHtml);  //Gets the output as a string
Viewing all 1060 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>