• Iterating Word objects efficiently (Word VBA)

    Home » Forums » AskWoody support » Productivity software by function » Visual Basic for Applications » Iterating Word objects efficiently (Word VBA)

    • This topic has 16 replies, 7 voices, and was last updated 20 years ago.
    Author
    Topic
    #416597

    Some recent threads have had to do with deleting objects from Word documents. As the posts show, when deleting objects, you’re usually confined to using a For…Next loop rather than the much faster For…Each loop.

    But in some cases, there’s a third alternative that I haven’t seen discussed before (though I admittedly didn’t look very hard) that offers nearly the same speed as a For…Each loop, but with the flexibility to delete objects along the way found in a For…Next loop.

    One such case is with Paragraphs, obviously something that you might often need to iterate and occasionally delete.

    Paragraphs, along with a handful of other objects (fields ar e another), include a “Next” property, which returns the next object in the series. By using the Next property, you can quickly move along a collection, while still being able to delete items along the way if needed.

    The following three examples don’t actually delete paragraphs — I thought I’d keep it simple for illustration purposes — but they do nicely illustrate the three different techniques for iterating Paragraphs in a Word doucment. I ran all three on the same, 252-page, 4,030-paragraph Word document.

    1. The first uses a For…Next loop, and was quite slow, as you might expect. 2 minutes, 20 seconds.
    2. The second uses a For…Each loop, and was a bit speedier, at 2 seconds (yup!)
    3. The third, which starts at the first paragraph and uses the Next property to move along, also took … 2 seconds
    Sub IterateParasTheSlowWay()
    Dim doc As Document
    Dim para As Paragraph
    Dim k As Integer
    Set doc = ActiveDocument
    
    For k = doc.Paragraphs.count To 1 Step -1
        Set para = doc.Paragraphs(k)
        If para.Style = doc.Styles(wdStyleHeading1) Then
            para.Range.HighlightColorIndex = wdBrightGreen
        End If
    Next k
    
    End Sub
    '----------------------------------------------------------
    Sub IterateParasTheFastestWay()
    Dim doc As Document
    Dim para As Paragraph
    
    Set doc = ActiveDocument
    
    For Each para In doc.Paragraphs
        If para.Style = doc.Styles(wdStyleHeading1) Then
            para.Range.HighlightColorIndex = wdBrightGreen
        End If
    Next para
    
    End Sub
    '------------------------------------------------------------
    Sub IterateParasTheFastAndFlexibleWay()
    Dim doc As Document
    Dim para As Paragraph
    Dim paraNext As Paragraph
    Set doc = ActiveDocument
    
    Set para = doc.Paragraphs.First
    Do While Not para Is Nothing
        Set paraNext = para.Next
        If para.Style = doc.Styles(wdStyleHeading1) Then
            para.Range.HighlightColorIndex = wdBrightGreen
        End If
        Set para = paraNext
    Loop
    
    End Sub
    

    In the case of this last subroutine, instead of applying highlighting, I could just as easily have deleted those Heading 1 paragraphs, and still been able to move along the collection correctly, since I’ve already got my hands on the following paragraph, which becomes the current paragraph on the next trip through the loop.

    For…Each loops are still my weapon of choice when doing standard iterations, but using the Next property technique (the “linked-list method” formally) has proved a valuable additon to my Word macro toolbox.

    Cheers!

    Viewing 3 reply threads
    Author
    Replies
    • #932384

      Thanks for sharing! This will come in handy, I’m sure.

      For others reading this, Next and Previous are properties of the following objects:
      Cell
      Column
      Field
      FormField
      MailMergeField
      Pane
      Row
      TabStop
      TextFrame
      Window

    • #932731

      Thanks Andrew. Neat trick and neat code samples too.

      Alan

    • #932753

      Andrew,

      That is neat, thanks for sharing it. There have been threads here before with regard to using .Next to iterate quickly, but that “Set obj = objNext” is a really nice trick.

      Another object that needs to be added to the list of objects that support First/Next is the Range object; in particular, this allows you iterate through the Characters collection (you can also do it with For Each, but that is minus the benefit of your method). In this example, all characters that are upper-case get highlighted (don’t try this on a 200 page document!):

      Sub IterateCharactersNext()
      
         Dim doc As Document
         Dim char As Range
         Dim charNext As Range
         Set doc = ActiveDocument
         
         Set char = doc.Characters.First
         Do While Not char Is Nothing
            Set charNext = char.Next
               If char.Case = wdUpperCase Then
                  char.HighlightColorIndex = wdBrightGreen
               End If
            Set char = charNext
         Loop
         
         Set doc = Nothing
         Set char = Nothing
         Set charNext = Nothing
         
      End Sub
      

      Gary

      • #932786

        Hi Gary,

        Thanks for the info on the Range object — that also means you can go by word as well as by character (and sentences, but Word’s definition of a sentence is a bit sketchy).

        Another recurring topic on the board is iterating over each character, just like you’ve described. The standard objections to doing so is slow (which is why you’ve warned against running your macro on a long document).

        But sometimes iterating each character is the best or only way to tackle a problem, so the question moves to how to optimize the iteration, so that you (well, not you specifically, Gary) only iterate characters when you absolutely have to.

        For example, if you wanted to work on any characters in a document whose formatting was different from that defined by its paragraph or character style (such as direct bold or italic applied), one fairly efficient approach is the following.

        This macro uses two supporting functions to isolate only those words in the document that contain some degree of direct formatting (for illustration purposes, I’ve confined ‘direct formatting’ to mean bold, italic, size or font name change — in practice, that’s usually sufficient).

        '=============================
        Sub IterateCharactersSelectively()
        Dim doc As Document
        Dim wrd As Range
        Dim char As Range
        Dim para As Paragraph
        
        Set doc = ActiveDocument
        For Each para In doc.Paragraphs
            If AnyDiffFontsInPara(para) = True Then
                For Each wrd In para.Range.Words
                    If AnyDiffFontsInWord(wrd) = True Then
                        wrd.Select
                        MsgBox "This word has character formatting " & _
                               "that is inconsistent with its style"
                        ' now you only have to iterate each character
                        ' in a word, rather than a whole paragraph
                        ' or a whole document. Put your character
                        ' iterating/modifying code here
                    End If
                Next wrd
            End If
        
        Next para
        End Sub
        

        Basically, there’s no point iterating all the characters in a particular paragraph if none of them are any different from the paragraph style properties. So by checking that first, we can move quickly past a lot of text. If and when we do find a paragraph that contains differing formatting, then we go word by word to isolate the problem, only then iterating each character. Depending on the amount of direct formatting in a document, and the average number of characters per word in your document, this technique can be several orders of magnitude faster than iterating each character in the document. Your mileage may vary.

        Here are the two supporting functions used by the main macro. These could be adjusted as needed to look for things like highlighting or superscripting.

        '===============================================
        Function AnyDiffFontsInPara(para As Paragraph) As Boolean
        Dim lDiffBold As Long
        Dim lDiffItal As Long
        Dim lDiffSize As Long
        Dim sDiffName As String
        
        AnyDiffFontsInPara = False
        
        With para.Range.Font
            lDiffBold = .Bold
            lDiffItal = .Italic
            lDiffSize = .Size
            sDiffName = .Name
        End With
        
        Select Case wdUndefined
            Case lDiffBold
                AnyDiffFontsInPara = True
                Exit Function
             Case lDiffItal
                AnyDiffFontsInPara = True
                Exit Function
             Case lDiffSize
                AnyDiffFontsInPara = True
                Exit Function
        End Select
        
        If Len(sDiffName) = 0 Then
             AnyDiffFontsInPara = True
            Exit Function
        End If
        End Function
        '==========================================
        Function AnyDiffFontsInWord(wrd As Range) As Boolean
        
        Dim docstyles As Styles
        Dim wrdstyle As String
        wrdstyle = wrd.Style
        Set docstyles = wrd.Parent.Styles
        
        Select Case True
            Case (Not wrd.Font.Bold = docstyles(wrdstyle).Font.Bold)
                AnyDiffFontsInWord = True
            Case (Not wrd.Font.Italic = docstyles(wrdstyle).Font.Italic)
                AnyDiffFontsInWord = True
            Case (Not wrd.Font.Name = docstyles(wrdstyle).Font.Name)
                AnyDiffFontsInWord = True
            Case (Not wrd.Font.Size = docstyles(wrdstyle).Font.Size)
                AnyDiffFontsInWord = True
        End Select
        End Function
        
        
        
        

        Cheers!

        • #932800

          Andrew,

          Thanks for posting this as well – this is great stuff, and deserves a star of its own. I may have missed some related threads in the past year or so, but recall a long one from 2001 or so on this same topic (will post a link later if I can track it down). If I recall right, Klaus Linke suggested a similar approach to optimizing by filtering what gets searched, but it’s safe to say that nothing posted back then, approached this for elegance.

          Thanks also for demonstrating some unusual ways to use Select Case structures:

          Select Case wdUndefined
              Case lDiffBold
          
          Select Case True
              Case (Not wrd.Font.Bold = docstyles(wrdstyle).Font.Bold)

          Who knew? shrug clapping

          Gary

        • #933001

          Andrew thanks for this. I found that it wasn’t isolating individual words within a range, and so I modified it slightly (attached) to collect a “fnt” object from the first character of the paragraph; I also made it a function that accepts a Range as parameter, so I’m not restricted to a docUment.

          PS I should add that I purchased a copy of WordHacks two weeks ago, and love it.

          • #933004

            Hi Chris,

            Glad to hear you like the book — it’s very gratifiying to hear that people have found it useful.

            I’m a little unclear on what you mean by “wasn’t isolating individual words within a range”; could you describe the problem (or post a sample document)? I wasn’t able to get it to not isolate on each word. Your revised macro and my original produced the same results for me.

            • #933089

              > “wasn’t isolating individual words within a range”

              Andrew, I have attached a Sample.doc containing two paragraphs which themselves contains a text formatted in a user-defined character style (MacroCharacters). The VBA module has a copy of your code. If I extend the formatting to include the second part of the word preceding the original formastting, it is well-detected.

              Your code is timely as I am currently analysing 6,000+ documents with a client’s request to isolate all non-standard formatting, and had been using an abbreviated “font” object, much as you suggest, for matching:

              With .Font
                   strresult = strresult & .Bold & strdelim & .Italic & strdelim & _
                        .Underline & strdelim & .Size & strdelim & .StrikeThrough & strdelim
                  .Bold = wdUndefined

              >people have found it useful
              I wouldn’t have described it as “useful” (grin!)

            • #933112

              Ah, perhaps I wasn’t clear on the macro’s evaluation criteria. The goal is to isolate direct formatting not associated with a style — a paragraph style or a character style. In the case of your sample document, those words that use the “MacroCharacter” style are perfectly acceptable — the user has correctly applied a character style to differentiate a portion of a paragraph. If, however, you apply additional formatting on top of the MacroCharacter style, like italics, the text will get flagged.

              If you want to detect any deviation from the paragraph style (including the use of character styles), you could probably just change:

              Dim wrdstyle As String
              wrdstyle = wrd.Style
              

              to

              Dim parastyle as String
              parastyle = wrd.Paragraphs.First.Style
              

              I have not tested that, by the way.

              Hope this makes sense. Cheers!

            • #933141

              > detect any deviation from the paragraph style
              Right, thanks, and yes, it does make sense.
              I realised this morning that an essential part of any detection like this will be establishing the basic criteria.
              My mind was on “different from the first character of the paragraph”, but it could have easily been “different from the first word of the paragraph” (including FWIW “undefined” as an allowable basis for comparison). Your example was, then “different from the style of the word”. I think I got that right. I’m pretty sure, though, that my problem was caused by my not reading your definition, and trying to make the code do what I wanted without first priming it in the correct manner.

    • #935066

      Hi Andrew,

      I just read the entire thread to date since I haven’t had a chance to read much of anything on the Lounge lately. This looked interesting.

      Let me hypothesize that the first loop is doing something different than the 2nd and 3rd. Practically speaking they result in the same output. I don’t know if this is really true, so this could be all smoke.

      The For…Next goes thru the paras in reverse order. Is it possible that Word or VBA has to step thru a link list each time to find the i-th para? Although I know why you go in reverse order, would the 1st approach be a little better than it is now if you went forward? Maybe not much.

      The 2nd approach lets Word/VBA do the driving and keeps track of pointers as you go thru the loop of paras. It probably takes advantage of the For…Each construct to step thru the collection of paras using pointer mechanisms. The 3rd approach seems like it could be the same with you doing the work instead of Word. In fact, I wouldn’t be surprised if the actual implementation of the 2nd approach looked like your 3rd approach.

      I would also suspect that there could be a difference in how the loop conditions were handled in the first 2 cases. For example, do you know if the looping statement

      For k = doc.Paragraphs.count To 1 Step -1

      has to retrieve the doc.Paragraphs.count in each iteration? Although this would be bad programming in terms of compiling the source code (or interpreting it), I’ve seen worse. If this makes a diff, than I always stored this kind of var in a local var. That is:

      paraCount = doc.Paragraphs.count
      For k = paraCount to 1 Step -1

      Also, another key diff between your 1st and 2nd approaches is the need for the Set stmt in the 1st approach. This probably adds overhead in that the code has to set a pointer after retrieving yet other info (doc.Paragraphs(k)). The 2nd approach is letting the loop mechanism take care of this so you’re cutting out figuring out what doc.Paragraphs(k) is.

      Even tho the 3rd approach does a Set, you’re setting paraNext, itself a “pointer”, to a “pointer” in the current para, which you already have access to. So, as I mentioned above, the 2nd and 3rd approaches should be the same.

      I’d also wonder if the size of the doc may have something to do with the big diff. For example, in getting to a 200+ page doc, I doubt that you entered the paras sequentially. Type a few paras, go back and insert a para before another para, copy and paste a few paras from another document in between 2 existing paras. I’m going to guess not and that Word probably relinks the para collection when you insert. So the link list would look the same when all’s said and done regardless of whether you entered them “right the first time” or went back and forth as mentioned just above.

      Or this could be way off base.

      Fred

      • #935185

        Hi Fred,
        [indent]


        Let me hypothesize that the first loop is doing something different than the 2nd and 3rd. Practically speaking they result in the same output. I don’t know if this is really true, so this could be all smoke.


        [/indent]
        For … Each loops are an optimized shortcut for iterating a collection of objects, or an array of Variants. It’s functionally (though not performance wise) to use either of the following two loops:

        For Each para in ActiveDocument.Paragraphs
          ' Do something here
        Next para
         ' ------
        For i = 1 to ActiveDocument.Paragraphs.Count
          Set para = ActiveDocument.Paragraphs(i)
          ' do something here
        Next i
        

        In the case of the For Each loop, the set statement is implicit, and there’s no need for an iterator variable, since the 1 to .count is also implicit. The For Each loop is faster because VBA can, in effect, pre-load the objects your loop needs. The For..Next loop on the other hand, can’t do that, because VBA has no way of knowing whether or how much the value of i might change between iterations.
        [indent]


        The For…Next goes thru the paras in reverse order. Is it possible that Word or VBA has to step thru a link list each time to find the i-th para? Although I know why you go in reverse order, would the 1st approach be a little better than it is now if you went forward? Maybe not much.


        [/indent]
        The order in which you iterate doesn’t matter for speed, but is important if you want to delete any items. Deleting items while moving forward will result in skipped items, which is the same thing that can happen if you try deleting while using a For..Each loop. Consider this example:
        A document with 4 paragraphs, in this order: Heading 1, Heading 2, Heading 2, Normal.

        For k = 1 to ActiveDocument.Paragraphs.Count
          If ActiveDocument.Paragraphs(k).Style = "Heading 2" Then ActiveDocument.Paragraphs(k).Delete
        Next k
        

        In this case, the third paragraph won’t get deleted (and you’ll get an error when k gets to 4).

        [indent]


        The 2nd approach lets Word/VBA do the driving and keeps track of pointers as you go thru the loop of paras. It probably takes advantage of the For…Each construct to step thru the collection of paras using pointer mechanisms. The 3rd approach seems like it could be the same with you doing the work instead of Word. In fact, I wouldn’t be surprised if the actual implementation of the 2nd approach looked like your 3rd approach.


        [/indent]
        I think you’re probably pretty close on that.

        [indent]


        I would also suspect that there could be a difference in how the loop conditions were handled in the first 2 cases. For example, do you know if the looping statement
        For k = doc.Paragraphs.count To 1 Step -1
        has to retrieve the doc.Paragraphs.count in each iteration?


        [/indent]
        No, the value is computed once at the start of the loop, not during each iteration.

        [indent]


        I’d also wonder if the size of the doc may have something to do with the big diff. For example, in getting to a 200+ page doc, I doubt that you entered the paras sequentially. Type a few paras, go back and insert a para before another para, copy and paste a few paras from another document in between 2 existing paras. I’m going to guess not and that Word probably relinks the para collection when you insert. So the link list would look the same when all’s said and done regardless of whether you entered them “right the first time” or went back and forth as mentioned just above.


        [/indent]
        I actually did insert them sequentially, using the rand() trick. I wouldn’t think the order in which the paragraphs were entered would have much impact on the efficiency of the iteration, but I could be wrong.

        Thanks for the insightful comments!

        • #935324

          Based on my experience, it looks to me like if you use For Each…Next to iterate through a document’s paragraphs, deleting some of the paragraphs as you go, no paragraphs get skipped (assuming the code in the loop isn’t using some kind of index reference that hasn’t been adjusted to account for the deletions). To take your 4-paragraph document example, this works (without any skipping):

              For Each parX In docX.Paragraphs
                  If parX.Style = "Heading 2" Then
                      parX.Range.Delete
                  End If
              Next parX

          I’ve found the same to be true of iterating through a document’s styles, and I expect it’s true of most of the VBA object collections.

          • #935422

            Hi Steve,

            I’ve definitely run into problems when deleting while using a For Each loop, particularly with Hyperlinks and Fields. For example, take a look at the attached document. It’s got a macro that tries to delete all the hyperlinks in a document with a For..Each loop:

            Sub DeleteWithForEach()
            Dim h as Hyperlink
            For Each h in ActiveDocument.Hyperlinks
               h.Delete
            Next h
            Exit Sub
            

            Just double click the macrobutton field in the first paragraph to run it.

            You’ll see it definitely does not delete all the hyperlinks.

            Rather than experiment with different collections to see which ones might work with a For Each, when deleting I always use a For Next and work backwards (or use the linked-list method described above).

            • #935485

              I see what you mean about Hyperlinks, but it looks like that’s a consistent, replicable “design flaw” having to do with the way For Each works with Hyperlinks. What happens in both your sample document and in a separate document I created is that the odd-numbered Hyperlinks (i.e., every other Hyperlink) get deleted, suggesting that Word is effectively using some kind of non-updated index counter to work through the document’s Hyperlinks.

              In a sense, the consistency is encouraging (to me, anyway) because it suggests (to me, anyway) that built-in object collections that don’t display the every-other-item behavior will probably consistently not be subject to the every-other-item flaw. Those “non-flawed” collections seem to include Paragraphs, Styles, date Fields (the only kind I’ve tried) and Bookmarks.

              I’d be interested to hear if anyone else has encountered “skipped items” behavior using For Each with Paragraphs or Styles where (1) the loop deleted items, and (2) the loop didn’t refer to items using an index that wasn’t adjusted to account for the deletions. Given that For Each..Next is supposedly much more efficient that For..Next when dealing with collections, I’d hope it would turn out the list of “flawed” collections is a narrow list and we can confidently use For Each with the rest.

    Viewing 3 reply threads
    Reply To: Iterating Word objects efficiently (Word VBA)

    You can use BBCodes to format your content.
    Your account can't use all available BBCodes, they will be stripped before saving.

    Your information: