Quantcast
Channel: Html Agility Pack
Viewing all 1060 articles
Browse latest View live

Commented Issue: SelectSingleNode searches from root when called from child [32539]

$
0
0
Even after having isolated to a child node via SelectNodes(), applying SelectSingleNode() to the child will search from the root above (presume from root, above anyway). This behaviour is unexpected, I expect it to search from the isolated child and below only.

Example:

<HTML>
<HEAD><TITLE> HTML Agility Bug Demo</TITLE></HEAD>
<BODY>
<somestuff>stuff here</somestuff>
<table>
<tr><td>first row</td></tr>
<tr><td>second row</td></tr>
<tr><td>third row</td></tr>
</table>
</BODY>
</HTML>

HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.Load(@"HtmlAgilityBugDemo.html");
HtmlNodeCollection rowNodes = doc.DocumentNode.SelectNodes("//table/tr");
foreach (HtmlNode row in rowNodes)
{
string test1 = row.InnerText; // Works, enumerates correctly
string test2 = row.SelectSingleNode("//td").InnerText; // This ALWAYS returns "first row" !!
string test3 = row.SelectSingleNode("//somestuff").InnerText; // Found somestuff. But no stuff within this node !!
}
Comments: ** Comment from web user: HarryCallahan **

Yes very annoying as it's a common thing to do.

I've got around it, or rather contended with it, by loading the child's InnerHtml into a new doc and using that. A heavy weight solution.


New Post: Extracting Images And HTML From .html File

$
0
0

Hello,

I'm new to the Html Agility Pack and was wondering if someone could help me out.  I have a WPF C# project with an HTML string as shown below:

htmlString = "<HTML><HEAD></HEAD><BODY>Here are some images.</br>1) <IMG style='MARGIN-BOTTOM: 20px; MARGIN-LEFT: 20px' align=right src='images/sample001.jpg'>2) <IMG style='MARGIN-BOTTOM: 25px; MARGIN-LEFT: 25px' align=right src='images/sample002.png'></br> And some docs as well.</br>1) href='javascript:parent.POPUP({url:'testDoc001.htm',type:'shared',width:600,height:645})'></br>2) href='javascript:parent.POPUP({url:'testDoc002.html',type:'shared',width:700,height:712})'></br></BODY></HTML>";

I would like to be able to parse this string and get out an array of all of the images and .html documents that are references.

In this particular example this array would be:


[0] = "images/sample001.jpg" [1] = "images/sample002.png" [2] = "testDoc001.htm" [3] = "testDoc002.html"

Can someone send me a snippet of code or show me how to go about doing this?

Thanks

New Post: Inner Text but the hard way

$
0
0


I need to get some text from this web page http://openbook.etoro.com/ahanit/#/profile/Trades/ I want to use the trade feed for my program to analyse the sentiment of the markets.

I used the VB browser control and the get element command but its not working. The problem is that whenever my browser starts to open the page I get java scripts errors. Every help is welcome



I tried with DOM but seems that i dont quite understand what i need to do :) Here is what i get until now

 

Dim code As String Using client As New WebClient

        code = client.DownloadString("http://openbook.etoro.com/alemzolota/#/profile/Trades/")
   
End Using

   
Dim htmlDocument As IHTMLDocument2 = New HTMLDocument(code)
    htmlDocument
.write(htmlDocument)


   
Dim allElements As IHTMLElementCollection = htmlDocument.body.all

   
Dim allid As IHTMLElementCollection = allElements.tags("id")
   
Dim element As IHTMLElement

   
For Each element In allid
        element
.title = element.innerText
        MsgBox
(element.innerText)

   
Next

Thank you :)
 

New Post: How to check HtmlWeb.LoadAsync finished

$
0
0

I have a class in Viewmodel folder that using HtmlWeb.LoadAsync to get data from web:

 

public void GetContent(int index)
        {
            //get content
            HtmlWeb.LoadAsync(Magazines[index].Url, (s, args) =>
            {

              ....

             this.Magazines[index].ContentNode = contentNode.InnerHtml;
            });

}

 

Then I want to get the Magazines[index].contentNode in detailview.xaml like this:

protected override void OnNavigatedTo(NavigationEventArgs e)
        {
            base.OnNavigatedTo(e);
            string selectedIndex = "";
           
            if (NavigationContext.QueryString.TryGetValue("selectedItem", out selectedIndex))
            {
              index = int.Parse(selectedIndex);
              App.MagazineViewModel.GetContent(index);
              String content = App.MagazineViewModel.Magazines[index].ContentNode;
              DetailBrser.NavigateToString(
                 "<html><head><meta name='viewport' content='width=570, user-scalable=yes' /></head><body>"
                 + HtmlHelper.EncodeUnicode(content)
                 + "</body></html>"
                 );
            }

But the problem is the loadAsync method has not finished yet, so App.MagazineViewModel.Magazines[index].contentNode is empty. that also make content empty.  so how can I check App.MagazineViewModel.GetContent(index) finish in detailview.xaml then set the content string. Or any other idea for this.

New Post: A bug when save to a stream

$
0
0
The methods 
"public void Save(Stream outStream, Encoding encoding)"
and 
" public void Save(Stream outStream)"
 in class HtmlDocument,declare a StreamWriter for writing data to stream with default bufferSize.
But not with a flush or close method at end of wirte.So some data in buffer will be lost.
eg:
System.IO.MemoryStream ms = new MemoryStream();
 htmldoc.Save(ms, System.Text.Encoding.UTF8);		
 
Chang the method "public void Save(StreamWriter writer)" in HtmlDocument as following:
public void Save(StreamWriter writer)
        {
            Save((TextWriter)writer);
            writer.Flush();       //add Flush method to write buffer data to stream
        }

New Post: Gett value from single node

$
0
0

I try get single value from this node:

http://gg.pl/dysk/5TppYI-rUkJB5DppYI-rauA/depozyty%20z%C5%82otowe%20-%20WIBOR%206M-121502.png

from this site :

http://www.money.pl/pieniadze/depozyty/zlotowe/WIBOR6M,depozyty.html

 

        Dim text As String
        Dim strona As New HtmlAgilityPack.HtmlWeb()
        Dim doc As New HtmlAgilityPack.HtmlDocument()
        doc = strona.Load("http://www.money.pl/pieniadze/depozyty/zlotowe/WIBOR6M,depozyty.html")
        text = doc.DocumentNode.SelectSingleNode("//html/body/div[4]/div[2]/div[2]/div[2]/div[2]/div/table/tbody/tr/td[2]").InnerText

I have error when try gett vale:

System.NullReferenceException was unhandled 

So, where I have error ??

 

 

Created Issue: Cannot find type System.Xml.XPath.XPathNavigator in module System.Xml.dll [32616]

$
0
0
Happens as soon as I include the HtmlAgilityPack.dll reference and build the project.

New Post: Html encoding only text nodes

$
0
0

Hi,

Struggled with this for a while.

Can html agility pack currently encode only text nodes ?

I.E : "

hi <3 Did you know we're stocklists blah blah, can read our blog here 
<a href='http://google.com'>http://google.com</a> blah
<3 hi

"

would become

"

hi &lt;3 Did you know we&#39;re stocklists blah blah, can read our blog here 
<a href='http://google.com'>http://google.com</a> blah
&gt;3 hi

"

I searched over the web and found no answer. The thing is because of the processing I'm doing to the html text I don't really want for the html tags like link tag or any other to be encoded too

 

Thanks a ton,

Doru


Commented Issue: SelectSingleNode searches from root when called from child [32539]

$
0
0
Even after having isolated to a child node via SelectNodes(), applying SelectSingleNode() to the child will search from the root above (presume from root, above anyway). This behaviour is unexpected, I expect it to search from the isolated child and below only.

Example:

<HTML>
<HEAD><TITLE> HTML Agility Bug Demo</TITLE></HEAD>
<BODY>
<somestuff>stuff here</somestuff>
<table>
<tr><td>first row</td></tr>
<tr><td>second row</td></tr>
<tr><td>third row</td></tr>
</table>
</BODY>
</HTML>

HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.Load(@"HtmlAgilityBugDemo.html");
HtmlNodeCollection rowNodes = doc.DocumentNode.SelectNodes("//table/tr");
foreach (HtmlNode row in rowNodes)
{
string test1 = row.InnerText; // Works, enumerates correctly
string test2 = row.SelectSingleNode("//td").InnerText; // This ALWAYS returns "first row" !!
string test3 = row.SelectSingleNode("//somestuff").InnerText; // Found somestuff. But no stuff within this node !!
}
Comments: ** Comment from web user: ThomasGat **

Html Agility Pack is using XPath to address nodes. See the XPath examples at http://msdn.microsoft.com/en-us/library/ms256086.aspx. Html Agility Pack is doing exactly what you’ve asked it to do. The current node ‘row’ is the current context, not the root of a new document. Thus “//td” ask for the first <td> element in the whole document, which is always “first row” in your example.

If you want to search the current node and its children, use “.//td” and “.//somestuff”.

Commented Issue: SelectSingleNode searches from root when called from child [32539]

$
0
0
Even after having isolated to a child node via SelectNodes(), applying SelectSingleNode() to the child will search from the root above (presume from root, above anyway). This behaviour is unexpected, I expect it to search from the isolated child and below only.

Example:

<HTML>
<HEAD><TITLE> HTML Agility Bug Demo</TITLE></HEAD>
<BODY>
<somestuff>stuff here</somestuff>
<table>
<tr><td>first row</td></tr>
<tr><td>second row</td></tr>
<tr><td>third row</td></tr>
</table>
</BODY>
</HTML>

HtmlAgilityPack.HtmlDocument doc = new HtmlDocument();
doc.Load(@"HtmlAgilityBugDemo.html");
HtmlNodeCollection rowNodes = doc.DocumentNode.SelectNodes("//table/tr");
foreach (HtmlNode row in rowNodes)
{
string test1 = row.InnerText; // Works, enumerates correctly
string test2 = row.SelectSingleNode("//td").InnerText; // This ALWAYS returns "first row" !!
string test3 = row.SelectSingleNode("//somestuff").InnerText; // Found somestuff. But no stuff within this node !!
}
Comments: ** Comment from web user: HarryCallahan **

I stand corrected, and surprised.

Testing with XMLDocument,

XmlDocument doc = new XmlDocument();
doc.LoadXml(xml);
var rowNodes = doc.SelectNodes("//table/tr");
foreach (XmlNode row in rowNodes)
{
string test1 = row.InnerText; // Works, enumerates correctly
string test2 = row.SelectSingleNode("//td").InnerText; // This ALWAYS returns "first row" !!
string test3 = row.SelectSingleNode("//somestuff").InnerText; // Found somestuff. But no stuff within this node !!
}

The result is the same, so as ThomasGat says it's expected XPath behaviour and is corrected by the simple use of "." to denote current node.

New Post: Disable Proxy

$
0
0

Hi,

how can I totally disable the use of a proxy server?

New Post: How to use HAPPhone APIs to achieve this function

$
0
0

Hi,

I have ran into a problem usig XmlReader on WP7. It is because it does not support BIG5 encoding when trying to read XML content.

Here is what I was trying to do.

using (XmlReader reader = XmlReader.Create(http://feeds.feedburner.com/nownews/politic)
while (reader.Read())  // iterate through the document
   
switch (reader.NodeType)  
       
case XmlNodeType.Text:  
           
string s = reader.Value; //looking for content under all <item>

I wonder if someone can give quick code so that I can try if I can get character display correctly on device. Very appreciated!

thsieh 

New Post: Validate/fix a html content

$
0
0

Hi all,

Is it possible to validate/fix a html content with HtmlAgilityPack?

Thank you

New Post: .net 4.5 version and WinRT version

$
0
0

Soon will be avaliable  windows8 rc version (June). Developers can start writing apps now.

There are not librarys compatible with winrt (which can verifie with WACK ).

Html Agility Pack is one of the most interesting lib for me as a developer.

How can I help with porting Html Agility pack for .net 4.5 and WinRT

 

Sychev Igor

sychev-igor.90@mail.ru

skype: sychevigormsk

Created Issue: Html Agility Pack does not replace character references [32621]

$
0
0
This document:

<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.01 Transitional//EN"" ""http://www.w3.org/TR/html4/loose.dtd"">
<html><body>1&amp;2&#39;3&#x24;4</body></html>

shows in Internet Explorer as "1&2'3$4" (without the quotes)

However, loading it into Html Agility Pack's HtmlDocument, calling CreateNavigator on the document, and Evaluate("/html/body") on the navigator returns "1&amp;2&#39;3&#x24;4" (again, without the quotes)

This is incorrect, and inconsistent with the XML implementation of XPath that comes with .NET 4.0, which does replace "&amp;" with "&".

New Post: how to get table from another website with method=post

$
0
0

 I want a table from another website. For testing purpose i have made an html file and saved it on my desktop with following code:

<html>
<head>
    <title>Search</title>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">   
</head>
<body >
    <center>

     <form  method="POST" action="targetsite">

   <input type="submit" value="submit" id="Button1"/>
   <input type="hidden" name="searchpl" value="7884" />
   <input type="HIDDEN" name="NRequest" value="parameterN"/>
        </form>
    </center>
</body>
</html>

When i click submit button it works fine and redirect me to targetsite. but i dont want to be redirected to targetsite instead i want to get the targetsite second table data. now i know htmlagilitypack can do it, i have tested a site and it works fine but i dont know how to send post data in htmlagilitypack with above information. Can you help? there is an error when i use this code

WebClient myWebClient = new WebClient();
        var doc = new HtmlDocument();

        NameValueCollection myNameValueCollection = new NameValueCollection{
        {"searchpl","BZA 7884"}
        ,{"NRequest","ChannelType=ct/Browser|RequestType=rt/Business|SubSystemType=st/Payments|AgencyType=at/PVO|ServiceName=PVO_VIO_BY_PL|PageID=PVO_Search|PVO_V_NUMBER=|P_ID=BZA7884|PVO_SEARCH_T=false|PVO_CO=TRUE|PVO_P_TYPE=PAS|PVO_S_NAME=NY"}
                };

        byte[] responseArray = myWebClient.UploadValues("TargetSite", "Post", myNameValueCollection);
        xRow = "/html[1]/body[1]/center[1]/table[1]";
        doc.LoadHtml(Encoding.ASCII.GetString(responseArray));
       divScrap.InnerHtml= doc.DocumentNode.SelectSingleNode(xRow).InnerText.Trim();

 

Error detail: An Error Has Occurred Your session has timed out or expired.

Commented Issue: HtmlAgilityPack v1.4.3 parses tables wrong [32107]

$
0
0
I have installed HtmlAgilityPack via NuGet and it installed version 1.4.3

This version has an error when handling tables!

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>hap test table</title>
</head>
<body>
<table>
<tr>
<td>foo</td>
<td>bar</td>
</tr>
</table>
</body>
</html>

becomes

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>hap test table</title>
</head>
<body>
<table>
<tr>
<td>foo
<td>bar
</td></td></tr></table>
</body>
</html>

If I go back to version 1.4.0 then it works like it should...
Comments: ** Comment from web user: DarthObiwan **

So I'm finally able to get back working on HAP (been a few years of long hours and busy life). I am trying to repro this issue and so far with 1.3, 1.4, 1.4.3 I haven't been able to find any difference in InnerHtml nor WriteTo html output. Does anyone have a working example in code they could share that I could look at?

Commented Issue: HtmlAgilityPack v1.4.3 parses tables wrong [32107]

$
0
0
I have installed HtmlAgilityPack via NuGet and it installed version 1.4.3

This version has an error when handling tables!

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>hap test table</title>
</head>
<body>
<table>
<tr>
<td>foo</td>
<td>bar</td>
</tr>
</table>
</body>
</html>

becomes

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title>hap test table</title>
</head>
<body>
<table>
<tr>
<td>foo
<td>bar
</td></td></tr></table>
</body>
</html>

If I go back to version 1.4.0 then it works like it should...
Comments: ** Comment from web user: DarthObiwan **

Note I'm using the information provided in this post, like the original html that is supposed to demonstrate it. I wrote a program to save the output to a text file, made 3 projects in my solution, each referencing a different version of the dll and then compared the results. I've tried WriteTo() and InnerHtml so far

New Comment on "Examples"

$
0
0
hello, but if I wanted to copy the url IMG (.jpg) address within this code as I do? code: <TABLE id=uezszu_24 class="uiGrid fbPhotosGrid" cellSpacing=0 cellPadding=0> <TBODY> <TR> <TD class="vTop"> <DIV class=Wrapper><A class="uiMediaThumb uiScrollableThumb uiMediaThumbHuge" href="www.cccc.com/index.php" name=43563463 rel=theater aria-label="photo" ajaxify="dsgdgbdfgr45y6ghd"><I style="BACKGROUND-IMAGE: url(http://www.fressdgf.com/image.jpg)"></I></A></DIV></TD> </TR> </TBODY> </TABLE>

Commented Issue: incorrect parse

$
0
0
please try parse this and you will see a problem:

<form class="patrol" method="post" id="patrolForm">
<p class="timeleft">150</p>
</form>
Comments: ** Comment from web user: Siderite **

Duplicate of http://htmlagilitypack.codeplex.com/workitem/32505

Viewing all 1060 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>