• Non-ascii chars in unicode (or UTF-8 converter) (2003)

    Home » Forums » AskWoody support » Productivity software by function » MS Word and word processing help » Non-ascii chars in unicode (or UTF-8 converter) (2003)

    Author
    Topic
    #451287

    I have code that takes a Word document’s UNICODE content and puts it (in a raw reformatted html way) onto a web-page hosted by a Unix box that speaks UTF-8

    (For many reasons Word’s HTML is not the correct answer) flee

    Anyway – the document contains unicode characters such as smart quotes, Macron characters or user defined bullets. My ideal solution is an automated conversion of UNICODE to UTF-8 that I can drive by VBA. My secondary position is to be able to write code that detects any character that is going to give me grief. This set will be small because we are really only dual language – and all the Māori Macron characters I already detect.

    For instance, my current method of handling ‘known’ special characters such as em-dash is to change them to their equivalent UTF-code (&mdash —).

    Viewing 1 reply thread
    Author
    Replies
    • #1110603

      How do you currently access the “UNICODE content”? Is this something other than Range.Text?

      Can you use HTML Tidy (someone created a COM wrapper for it)?

      • #1110615

        Yes, it comes from range.text, and I hadn’t thought of HTML_Tidy for this because the html isn’t the problem (but I’m on the case now if it can do code conversion)

        I guess my problem is still “How do I detect that range.value contains a non-ascii character?” or how do I autoconvert it. Either works.

        Andrew

        • #1110741

          Here’s one way to fairly quickly find characters above the basic 255:

          Sub SubstituteNonAnsiChars()
          Dim p As Word.Paragraph, r As Word.Range, b() As Byte, _
          lngCount As Long, intPos As Integer, strNew As String
          ' Loop through paragraphs in document
          For Each p In ActiveDocument.Paragraphs
          Set r = p.Range
          ' Do processing of formatting
          ' === YOUR CODE HERE ===
          ' Create a byte array of characters (two slots each)
          b = r.Text
          ' Check for non-Ansi characters and replace with entities
          For lngCount = UBound( To 1 Step -2
          If b(lngCount) 0 Then
          ' Above 255, create entity with Unicode as hex
          intPos = (lngCount - 1) / 2
          strNew = "&#" & "x" & Right("00" & Hex(b(lngCount)), 2) & _
          Right("00" & Hex(b(lngCount - 1)), 2) & ";"
          ' Replace original range content
          r.Text = Left(r.Text, intPos) & strNew & Mid(r.Text, intPos + 2)
          End If
          Next
          ' Clean up
          Set r = Nothing
          Next
          ' Clean up
          If Not (p Is Nothing) Then Set p = Nothing
          End Sub

          I only tested it on a simple document, so if you find cases where it does not give the correct results, please post a sample document for testing.

          (Added: if you insert a stop after b=r.text, you can see why the for loop is set up the way it is.)

          • #1110759

            Cool – that’s my kind of code – I’ll give it a whirl thanks. (I anticipate correct results)

            I’m over most of my troubles now
            The simple one that caught me was 3/4 expressed by Word as a single Arial character – in UTF-8 it seems to appear as an A umlaut followed by the 3/4 as a single character.

    • #1110745

      I’m not so sure you need to change the Unicode characters–a computer set for UTF-8 encoding should be able to display most Unicode characters if they are in the font that is being used. But you may need to change ANSI characters (decimal 0128 to 0159) and any character formed by changing the font to, say, Symbol or Wingding.

      If you don’t have access to a Unix computer, try changing you browser’s character encoding to UTF-8 and then viewing your file.

      PamC

      • #1110773

        Thanks for the response Pam – our service provider uses Unix and it was those rare few cases that caused the trouble – e.g. many of our users love those wingding bullets and we publish documents created by others (e.g. researchers) so have little control over content or style. Personally I think it’s all Word’s fault grin and detecting the area the problem is in will work well for me
        (The browser’s encoding had been set to UTF-8).

        • #1110936

          You are very welcome. And you are very right about Microsoft being the cause of much of this confusion.

          If you plan to use find and replace to change the ANSI characters, you should know that find and replace only work with decimal (not hex) numbers and the the leading zero is very important.
          * 3 digits up to 255–or 4 digits up to 0255 but excluding the range 0128 to 0159–gives ASCII and Extended ASCII characters.
          * 4 digits from 0128 to 0159 gives Windows ANSI characters.
          For example, Alt+151 gives ù, Alt+0151 gives —

          * 4 digits or more greater than 0255 gives Unicode characters.

    Viewing 1 reply thread
    Reply To: Non-ascii chars in unicode (or UTF-8 converter) (2003)

    You can use BBCodes to format your content.
    Your account can't use all available BBCodes, they will be stripped before saving.

    Your information: