Topic: Non-ascii chars in unicode (or UTF-8 converter) (2003) @ AskWoody

Non-ascii chars in unicode (or UTF-8 converter) (2003)
Home » Forums » AskWoody support » Productivity software by function » MS Word and word processing help » Non-ascii chars in unicode (or UTF-8 converter) (2003)
- This topic has 7 replies, 3 voices, and was last updated 16 years, 10 months ago.
Author

Topic
New Reply

WSAndrewO
AskWoody Lounger

May 27, 2008 at 11:16 pm #451287

I have code that takes a Word document’s UNICODE content and puts it (in a raw reformatted html way) onto a web-page hosted by a Unix box that speaks UTF-8

(For many reasons Word’s HTML is not the correct answer)

Anyway – the document contains unicode characters such as smart quotes, Macron characters or user defined bullets. My ideal solution is an automated conversion of UNICODE to UTF-8 that I can drive by VBA. My secondary position is to be able to write code that detects any character that is going to give me grief. This set will be small because we are really only dual language – and all the Māori Macron characters I already detect.

For instance, my current method of handling ‘known’ special characters such as em-dash is to change them to their equivalent UTF-code (&mdash —).

Reply | Quote

Viewing 1 reply thread
Author

Replies
- WSjscher2000
  AskWoody Lounger
  
  May 28, 2008 at 1:46 am #1110603
  
  How do you currently access the “UNICODE content”? Is this something other than Range.Text?
  
  Can you use HTML Tidy (someone created a COM wrapper for it)?
  
  Reply | Quote
- WSAndrewO
  AskWoody Lounger
  
  May 28, 2008 at 6:54 am #1110615
  
  Yes, it comes from range.text, and I hadn’t thought of HTML_Tidy for this because the html isn’t the problem (but I’m on the case now if it can do code conversion)
  
  I guess my problem is still “How do I detect that range.value contains a non-ascii character?” or how do I autoconvert it. Either works.
  
  Andrew
  
  Reply | Quote
  
  WSjscher2000
  AskWoody Lounger
  
  May 29, 2008 at 2:13 am #1110741
  
  Here’s one way to fairly quickly find characters above the basic 255:
  
  Sub SubstituteNonAnsiChars() Dim p As Word.Paragraph, r As Word.Range, b() As Byte, _ lngCount As Long, intPos As Integer, strNew As String ' Loop through paragraphs in document For Each p In ActiveDocument.Paragraphs Set r = p.Range ' Do processing of formatting ' === YOUR CODE HERE === ' Create a byte array of characters (two slots each) b = r.Text ' Check for non-Ansi characters and replace with entities For lngCount = UBound( To 1 Step -2 If b(lngCount) 0 Then ' Above 255, create entity with Unicode as hex intPos = (lngCount - 1) / 2 strNew = "&#" & "x" & Right("00" & Hex(b(lngCount)), 2) & _ Right("00" & Hex(b(lngCount - 1)), 2) & ";" ' Replace original range content r.Text = Left(r.Text, intPos) & strNew & Mid(r.Text, intPos + 2) End If Next ' Clean up Set r = Nothing Next ' Clean up If Not (p Is Nothing) Then Set p = Nothing End Sub
  
  I only tested it on a simple document, so if you find cases where it does not give the correct results, please post a sample document for testing.
  
  (Added: if you insert a stop after b=r.text, you can see why the for loop is set up the way it is.)
  
  Reply | Quote
  
  WSAndrewO
  AskWoody Lounger
  
  May 29, 2008 at 11:49 am #1110759
  
  Cool – that’s my kind of code – I’ll give it a whirl thanks. (I anticipate correct results)
  
  I’m over most of my troubles now
  The simple one that caught me was 3/4 expressed by Word as a single Arial character – in UTF-8 it seems to appear as an A umlaut followed by the 3/4 as a single character.
  
  Reply | Quote
- WSPamCaswell
  AskWoody Lounger
  
  May 29, 2008 at 5:01 am #1110745
  
  I’m not so sure you need to change the Unicode characters–a computer set for UTF-8 encoding should be able to display most Unicode characters if they are in the font that is being used. But you may need to change ANSI characters (decimal 0128 to 0159) and any character formed by changing the font to, say, Symbol or Wingding.
  
  If you don’t have access to a Unix computer, try changing you browser’s character encoding to UTF-8 and then viewing your file.
  
  PamC
  
  Reply | Quote
- WSAndrewO
  AskWoody Lounger
  
  May 29, 2008 at 11:56 am #1110773
  
  Thanks for the response Pam – our service provider uses Unix and it was those rare few cases that caused the trouble – e.g. many of our users love those wingding bullets and we publish documents created by others (e.g. researchers) so have little control over content or style. Personally I think it’s all Word’s fault and detecting the area the problem is in will work well for me
  (The browser’s encoding had been set to UTF-8).
  
  Reply | Quote
  
  WSPamCaswell
  AskWoody Lounger
  
  May 30, 2008 at 5:33 am #1110936
  
  You are very welcome. And you are very right about Microsoft being the cause of much of this confusion.
  
  If you plan to use find and replace to change the ANSI characters, you should know that find and replace only work with decimal (not hex) numbers and the the leading zero is very important.
  * 3 digits up to 255–or 4 digits up to 0255 but excluding the range 0128 to 0159–gives ASCII and Extended ASCII characters.
  * 4 digits from 0128 to 0159 gives Windows ANSI characters.
  For example, Alt+151 gives Ã¹, Alt+0151 gives â€”
  
  * 4 digits or more greater than 0255 gives Unicode characters.
  
  Reply | Quote
Viewing 1 reply thread

Reply To: Non-ascii chars in unicode (or UTF-8 converter) (2003)
You can use BBCodes to format your content.
Your account can't use all available BBCodes, they will be stripped before saving.

Your information:
Name (required):

Mail (will not be published) (required):

Website:

Cancel

Plus Membership

Donations from Plus members keep this site going. You can identify the people who support AskWoody by the Plus badge on their avatars.

AskWoody Plus members not only get access to all of the contents of this site -- including Susan Bradley's frequently updated Patch Watch listing -- they also receive weekly AskWoody Plus Newsletters (formerly Windows Secrets Newsletter) and AskWoody Plus Alerts, emails when there are important breaking developments.

Welcome to our unique respite from the madness.

It's easy to post questions about Windows 11, Windows 10, Win8.1, Win7, Surface, Office, or browse through our Forums. Post anonymously or register for greater privileges. Keep it civil, please: Decorous Lounge rules strictly enforced. Questions? Contact Customer Support.

Non-ascii chars in unicode (or UTF-8 converter) (2003)

Plus Membership

Search Newsletters

Search Forums

View the Forum

Search for Topics

Recent Topics

Recent blog posts

My Profile

Key Links

Remembering Woody

Non-ascii chars in unicode (or UTF-8 converter) (2003)

Plus Membership

Search Newsletters

Search Forums

View the Forum

Search for Topics

Recent Topics

Recent blog posts

My Profile

Login and Registration

Key Links

Remembering Woody