• IE Object Model (from Access/VBA)

    Author
    Topic
    #393360

    Does anybody have a good reference that covers the IE Object Model that I can use to scrape data feom Web pages and put it into Access tables.

    Viewing 1 reply thread
    Author
    Replies
    • #712735

      Note to all: if you know of any good resources on the W3C standard Document Object Model (DOM), that probably would also help Pat (even if it doesn’t have all of the proprietary Microsoft extensions).

      • #714458

        Hi Jefferson
        Where did you learn about the IE Object Model?

        • #714520

          VBA object browser, MSDN, web searches… my HTML, VBScript and JavaScript books were modestly helpful. And I wouldn’t say that I know all that much.

        • #714521

          VBA object browser, MSDN, web searches… my HTML, VBScript and JavaScript books were modestly helpful. And I wouldn’t say that I know all that much.

          • #715251

            Hi everyone, another question along the same lines.

            Is there a way you can get the URL behind a variable so I can then read that page, etc, etc.

            • #715259

              Can you explain the scenario a bit more: What kind of variable and where did you get it in the first place?

            • #715273

              I am in unfamiliar territory here.
              When I pull up a page, and it contains a word or words that are highlighted in blue (I presume because it has a URL underneath it, is this called a hyperlink?) I need to be able to get at this URL to go to the next page, etc, etc.

            • #715288

              Pat, when you use the document object model, your document has several collections that could be useful here. Assume you have created an object reference to the HTML document…

              Dim myHTMLDoc As MSHTML.HTMLDocument
              Set myHTMLDoc = Something that returns an HTML document…not important for current purposes

              … the one that seems most relevant (and, most specific, which is important to avoid mistakenly targeting some garbage code) is the links collection:

              • myHTMLDoc.links.length gives you the count of all links in the entire page; remember that the collection is numbered starting from zero, so the index of the last item in the collection is length-1.
              • myHTMLDoc.links.item(0).innerHTML gives you the exact HTML code that is used to generate the visual display associated with the first link; it could be plain text, or text with HTML tags (such as an IMG tag), or just an image tag.
              • myHTMLDoc.links.item(0).innerText gives you the visible text, if any, that is associated with the first link; HTML tags are stripped out.
              • myHTMLDoc.links.item(0).href gives you the complete path for the first link.[/list]You could loop through the collection looking for a match to the expected “innerText” or use your imagination.
            • #715292

              Thanks Jefferson, I’m sorry to be such a pest about this but I really need to find out about this.

              Now you are talking about a Document Object Model rather than the IE Object Model which you provided some code. That code works very well thank you.

              Can the Document Object Model read in tables like the IE Object Model can?

              Where can I get some doco to read up on for the Document Object Model?

            • #715588

              The MSHTML library contains Microsoft’s encapsulation of the document object model (DOM). It is largely compliant with the W3C model, but has proprietary extensions such as the .all collection that you will see used frequently in code written for Internet Explorer version 4. In this sense, it is and is not really the Internet Explorer object model. grin I hope that sort of clarifies the terminology.

              I guess strictly speaking the Internet Explorer object model is the one that contains the InternetExplorer object. I don’t remember the name that appears in the Tools>References dialog, but it could be similar to Microsoft Internet Controls.

            • #715589

              The MSHTML library contains Microsoft’s encapsulation of the document object model (DOM). It is largely compliant with the W3C model, but has proprietary extensions such as the .all collection that you will see used frequently in code written for Internet Explorer version 4. In this sense, it is and is not really the Internet Explorer object model. grin I hope that sort of clarifies the terminology.

              I guess strictly speaking the Internet Explorer object model is the one that contains the InternetExplorer object. I don’t remember the name that appears in the Tools>References dialog, but it could be similar to Microsoft Internet Controls.

            • #715293

              Thanks Jefferson, I’m sorry to be such a pest about this but I really need to find out about this.

              Now you are talking about a Document Object Model rather than the IE Object Model which you provided some code. That code works very well thank you.

              Can the Document Object Model read in tables like the IE Object Model can?

              Where can I get some doco to read up on for the Document Object Model?

            • #722887

              Other than the Links collection, is there a text collection or something that can reference all text on a page?

            • #724170

              Not sure what you mean by “text.” The tag likely has an innerText property that would be the text of the entire , stripped of HTML tags. Is that what you’re looking for?

            • #724228

              I want to get at the text of every word on the screen, so I suppose that the “BODY” would give me all the text.
              How would I go about getting all that text into a variable?
              You have been an excellent source for this problem.

            • #724448

              Here’s some sample code:

              Option Explicit
              ‘Declare Sleep API
              Private Declare Sub Sleep Lib “kernel32” (ByVal nMilliseconds As Long)

              Sub RetrieveBODYText()
              ‘ Jefferson F. Scher 2003-10-04
              ‘ Uses IE DOM to grab BODY text from web page
              ‘ SET REFERENCES TO Microsoft HTML Object Library AND Microsoft Internet Controls
              ‘Create browser object references
              Dim ieSrc As New InternetExplorer

              ‘Load page
              With ieSrc
              .Visible = True ‘show window and load page
              .navigate “http://www.microsoft.com/homepage/ms.htm”
              While Not .readyState = READYSTATE_COMPLETE
              Sleep 500 ‘wait 1/2 sec before trying again
              Wend
              End With

              ‘Create document object model references
              Dim ieDocSrc As MSHTML.HTMLDocument
              Set ieDocSrc = ieSrc.Document

              ‘Fetch the BODY Text
              Dim strBODYtext As String, strBODYhtml As String, colBODYs As Variant
              Set colBODYs = ieDocSrc.all.tags(“BODY”)
              If colBODYs.Length = 0 Then
              MsgBox “Page has no body (maybe it’s a frameset?)”
              Else ‘ get first body
              strBODYtext = colBODYs(0).innerText
              strBODYhtml = colBODYs(0).innerHTML
              Stop ‘inspect vars in the Locals and/or Immediate window
              End If

              ‘Clean up objects
              Set ieDocSrc = Nothing
              ieSrc.Quit
              Set ieSrc = Nothing
              End Sub

            • #724631

              On first try, it seems just the ticket. I will get into it tonight.
              When you say get the first body (enclosed in (0)), does that mean there is more than one body to a page?

              Thank you.

            • #725258

              While there never should be more than one body, it seems safest to assume that there might be more than one. After all, this is HTML we’re talking about here, and not something that has any rules. wink

            • #725299

              Thanks Jefferson, I have used your code to get the all text and html on the page. It has solved my problem.

              Are you being cynical about HTML or what?

              Thanks again

            • #725301

              > Are you being cynical about HTML or what?

              I have wasted so much time over the years targeting web pages that kept changing… well… my only suggestion is to design your code so that when things do change, because they will, you don’t have to rearchitect everything.

            • #725305

              That’s good advice there.

              Thanks for your time and experience. I just needed the push in the right direction to generate the stuff I had to.

            • #725306

              That’s good advice there.

              Thanks for your time and experience. I just needed the push in the right direction to generate the stuff I had to.

            • #725302

              > Are you being cynical about HTML or what?

              I have wasted so much time over the years targeting web pages that kept changing… well… my only suggestion is to design your code so that when things do change, because they will, you don’t have to rearchitect everything.

            • #725300

              Thanks Jefferson, I have used your code to get the all text and html on the page. It has solved my problem.

              Are you being cynical about HTML or what?

              Thanks again

            • #725259

              While there never should be more than one body, it seems safest to assume that there might be more than one. After all, this is HTML we’re talking about here, and not something that has any rules. wink

            • #724632

              On first try, it seems just the ticket. I will get into it tonight.
              When you say get the first body (enclosed in (0)), does that mean there is more than one body to a page?

              Thank you.

            • #724449

              Here’s some sample code:

              Option Explicit
              ‘Declare Sleep API
              Private Declare Sub Sleep Lib “kernel32” (ByVal nMilliseconds As Long)

              Sub RetrieveBODYText()
              ‘ Jefferson F. Scher 2003-10-04
              ‘ Uses IE DOM to grab BODY text from web page
              ‘ SET REFERENCES TO Microsoft HTML Object Library AND Microsoft Internet Controls
              ‘Create browser object references
              Dim ieSrc As New InternetExplorer

              ‘Load page
              With ieSrc
              .Visible = True ‘show window and load page
              .navigate “http://www.microsoft.com/homepage/ms.htm”
              While Not .readyState = READYSTATE_COMPLETE
              Sleep 500 ‘wait 1/2 sec before trying again
              Wend
              End With

              ‘Create document object model references
              Dim ieDocSrc As MSHTML.HTMLDocument
              Set ieDocSrc = ieSrc.Document

              ‘Fetch the BODY Text
              Dim strBODYtext As String, strBODYhtml As String, colBODYs As Variant
              Set colBODYs = ieDocSrc.all.tags(“BODY”)
              If colBODYs.Length = 0 Then
              MsgBox “Page has no body (maybe it’s a frameset?)”
              Else ‘ get first body
              strBODYtext = colBODYs(0).innerText
              strBODYhtml = colBODYs(0).innerHTML
              Stop ‘inspect vars in the Locals and/or Immediate window
              End If

              ‘Clean up objects
              Set ieDocSrc = Nothing
              ieSrc.Quit
              Set ieSrc = Nothing
              End Sub

            • #724229

              I want to get at the text of every word on the screen, so I suppose that the “BODY” would give me all the text.
              How would I go about getting all that text into a variable?
              You have been an excellent source for this problem.

            • #724171

              Not sure what you mean by “text.” The tag likely has an innerText property that would be the text of the entire , stripped of HTML tags. Is that what you’re looking for?

            • #715289

              Pat, when you use the document object model, your document has several collections that could be useful here. Assume you have created an object reference to the HTML document…

              Dim myHTMLDoc As MSHTML.HTMLDocument
              Set myHTMLDoc = Something that returns an HTML document…not important for current purposes

              … the one that seems most relevant (and, most specific, which is important to avoid mistakenly targeting some garbage code) is the links collection:

              • myHTMLDoc.links.length gives you the count of all links in the entire page; remember that the collection is numbered starting from zero, so the index of the last item in the collection is length-1.
              • myHTMLDoc.links.item(0).innerHTML gives you the exact HTML code that is used to generate the visual display associated with the first link; it could be plain text, or text with HTML tags (such as an IMG tag), or just an image tag.
              • myHTMLDoc.links.item(0).innerText gives you the visible text, if any, that is associated with the first link; HTML tags are stripped out.
              • myHTMLDoc.links.item(0).href gives you the complete path for the first link.[/list]You could loop through the collection looking for a match to the expected “innerText” or use your imagination.
            • #715274

              I am in unfamiliar territory here.
              When I pull up a page, and it contains a word or words that are highlighted in blue (I presume because it has a URL underneath it, is this called a hyperlink?) I need to be able to get at this URL to go to the next page, etc, etc.

            • #715260

              Can you explain the scenario a bit more: What kind of variable and where did you get it in the first place?

          • #715252

            Hi everyone, another question along the same lines.

            Is there a way you can get the URL behind a variable so I can then read that page, etc, etc.

      • #714459

        Hi Jefferson
        Where did you learn about the IE Object Model?

      • #714586

        If you’re looking for a website reference for W3C, W3.org might be a starting point.

        Have a Great day!!!
        Ken

        • #714670

          I don’t know if W3C is the ticket but I’ll certainly have a look at this site.

          What I want is doco on how to scrape details from a site (this could include tables and other text). What Jefferson provided was an example of how to get the data from a fixed column table on a web page which proved invaluable.
          I have modified this somewhat to get what I want, but I would like to be able to access other information on this page. So any other doco on this topic would be extremely valuable.

        • #714671

          I don’t know if W3C is the ticket but I’ll certainly have a look at this site.

          What I want is doco on how to scrape details from a site (this could include tables and other text). What Jefferson provided was an example of how to get the data from a fixed column table on a web page which proved invaluable.
          I have modified this somewhat to get what I want, but I would like to be able to access other information on this page. So any other doco on this topic would be extremely valuable.

      • #714587

        If you’re looking for a website reference for W3C, W3.org might be a starting point.

        Have a Great day!!!
        Ken

    • #712736

      Note to all: if you know of any good resources on the W3C standard Document Object Model (DOM), that probably would also help Pat (even if it doesn’t have all of the proprietary Microsoft extensions).

    Viewing 1 reply thread
    Reply To: IE Object Model (from Access/VBA)

    You can use BBCodes to format your content.
    Your account can't use all available BBCodes, they will be stripped before saving.

    Your information: