Cleaning Microsoft Word tags from HTML in C#

Last post 01-02-2014, 4:33 PM by frJericho. 16 replies.
Sort Posts: Previous Next
  •  05-31-2013, 4:17 PM 77522

    Cleaning Microsoft Word tags from HTML in C#

    I am copying some sample content from Microsoft Word and pasting it into the HTML here: http://cutesoft.net/example/general.aspx

     

    When I paste, I get asked if I want to clean up the Microsoft tags.  I say yes and I get the following result:

     

    Which is very clean HTML without the Microsoft Word "mso" tags. So far, so good.

     

    I am trying to achieve the same result in C# code

    1. [TestMethod]  
    2. public void CleanUpMicrosoftWordHTML()  
    3. {  
    4.     var source = "<font face=\"Times New Roman\" size=\"3\"></font>";  
    5.     source += "<p class=\"MsoNormal\" style=\"margin: 0in 0in 0pt;\"><span lang=\"NL\"><o:p><font face=\"Times New Roman\" size=\"3\">&nbsp;</font></o:p></span></p>";  
    6.     source += "<font face=\"Times New Roman\" size=\"3\"></font>";  
    7.     source += "<pre style=\"text-indent: -0.25in; margin-left: 0.5in; mso-list: l0 level1 lfo1; tab-stops: list .5in left 45.8pt 91.6pt 137.4pt 183.2pt 229.0pt 274.8pt 320.6pt 366.4pt 412.2pt 458.0pt 503.8pt 549.6pt 595.4pt 641.2pt 687.0pt 732.8pt;\">";  
    8.     source += "<!--[if !supportLists]--><span class=\"migratedcontentfont1\"><span lang=\"NL\" style=\"font-family: Symbol; font-size: 12pt; mso-fareast-font-family: Symbol; mso-bidi-font-family: Symbol;\">";  
    9.     source += "<span style=\"mso-list: Ignore;\">&#183;<span style='font: 7pt/normal \"Times New Roman\"; font-size-adjust: none; font-stretch: normal;'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </span></span></span>";  
    10.     source += "</span><!--[endif]-->";  
    11.     source += "<span class=\"migratedcontentfont1\"><span lang=\"NL\" style='font-family: \"Times New Roman\",\"serif\"; font-size: 12pt;'>test<o:p></o:p></span></span></pre>";  
    12.   
    13.     var expected = "<pre><!--[if !supportLists]--><span style=\"font-family: Symbol; font-size: 12pt;\"><span>&#183;<span> </span></span></span><!--[endif]--><span style='font-family: \"Times New Roman\",\"serif\"; font-size: 12pt;'>test</span></pre>";  
    14.   
    15.     var result = EditorUtility.CleanUpMicrosoftWordHTML(source);  
    16.   
    17.     Assert.AreEqual(expected, result);  
    18. }  
    Unfortunately the result is not at all what was expected. The result contains: "&nbsp;&#183;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; test".

     

    So my question is: how can I properly clean up HTML in my C# code?

     

    By the way, I am running CuteEditor version 6.6  Buid 2013-04-22.

  •  06-19-2013, 12:25 PM 77588 in reply to 77522

    Re: Cleaning Microsoft Word tags from HTML in C#

    Hi,

     

    We will do more research on it and reply you soon.

     

    Regards,
    Terry

     

  •  07-01-2013, 8:19 PM 77638 in reply to 77588

    Re: Cleaning Microsoft Word tags from HTML in C#

    Any update?
  •  07-15-2013, 10:40 PM 77729 in reply to 77638

    Re: Cleaning Microsoft Word tags from HTML in C#

    Any update?
  •  07-18-2013, 12:30 PM 77737 in reply to 77729

    Re: Cleaning Microsoft Word tags from HTML in C#

  •  07-29-2013, 9:44 PM 77789 in reply to 77737

    Re: Cleaning Microsoft Word tags from HTML in C#

    Any update?
  •  07-31-2013, 12:59 AM 77806 in reply to 77789

    Re: Cleaning Microsoft Word tags from HTML in C#

    frJericho,

     

    I am very sorry to delay your request so long.

     

    As we researched this issue,

     

    The problem is that the server side remove word logic is not same as client side.

     

    Our plan is copying the javascript logic to C# , for both CuteEditor and RichTextEditor, 

     

    Recently we fixed tons of CuteEditor6.6 bugs , and we will release a new update for CuteEditor6.6 soon.

     

    Regards,
    Terry 

  •  08-07-2013, 6:06 AM 77822 in reply to 77789

    Re: Cleaning Microsoft Word tags from HTML in C#

  •  09-02-2013, 4:16 PM 77948 in reply to 77822

    Re: Cleaning Microsoft Word tags from HTML in C#

    The result is different when using version 6.7 but still doesn't produce the expected result.

     

    Here's an example where I am using an HTML string containing Microsoft Office tags:

    and here's the result after cleaning it up with CuteEditor 6.7 (notice the HTML tag is missing, the entire HEAD section is missing and even the BODY tag is missing):

     

    To be clear, I am simply trying to clean my HTML with the following line of C# code:

    var cleanHtml = EditorUtility.CleanUpMicrosoftWordHTML(originalHtml); 

     

    I will email the full HTML I use for my testing to [email protected]

  •  09-02-2013, 4:57 PM 77949 in reply to 77948

    Re: Cleaning Microsoft Word tags from HTML in C#

    Hi frJericho,

     

    I can reproduce this issue too, will report to the development team to check it again. Sorry for your inconvenience.

     

    Regards,

     

    Ken 

  •  09-16-2013, 8:20 PM 77989 in reply to 77949

    Re: Cleaning Microsoft Word tags from HTML in C#

    Any update?
  •  09-30-2013, 10:28 PM 78037 in reply to 77989

    Re: Cleaning Microsoft Word tags from HTML in C#

    Any update?
  •  10-07-2013, 2:16 PM 78066 in reply to 78037

    Re: Cleaning Microsoft Word tags from HTML in C#

    Any update?
  •  11-04-2013, 2:23 PM 78300 in reply to 78066

    Re: Cleaning Microsoft Word tags from HTML in C#

     Any update?
  •  11-20-2013, 6:47 PM 78400 in reply to 78300

    Re: Cleaning Microsoft Word tags from HTML in C#

     Kenneth,

     

    Can you give me an update on this situation?

     
  •  12-26-2013, 2:56 AM 78664 in reply to 77522

    Re: Cleaning Microsoft Word tags from HTML in C#

     Hi there

    It is difficult for me to do that using a code.For me ,i usually process the word using a word processing tool.It supports to process or convert word to html.As for clean up HTML in c#.I have never tried to do that.

    But you can try to add a word tool to help you.They offer detailed tutorial and code for new users.

    Hope to help you.

  •  01-02-2014, 4:33 PM 78681 in reply to 78664

    Re: Cleaning Microsoft Word tags from HTML in C#

    Kenneth? Adam?

     

    any update? 

View as RSS news feed in XML