When I copy the following paragraph from Word to Editor, it carries a lot of Word-specific code. We have a strict limit in the number of characters that can go into the database field and so I use the "Remove all Word
specific markup" button in Clean Up HTML button and a lot of it goes away -- except what highlighted in red in the HTML below. About half of the characters in that paragraph are still Word code -- so it could be a lot fewer characters if ALL Word code were removed.
Reoccurring coding like that from Word should also be removed by the "Remove all Word
specific markup" . Yes, it is enclosed in angle brackets so it might look like HTML tags, but inside the brackets the words are exact -- st1:personname and st1:place etc -- and so you could easily recognize this as coming from Word and remove these strings too.
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Vivamus pulvinar. Vestibulum varius elit et dui. Ut sit amet nulla. Quisque diam est, feugiat sit amet, egestas nec, fringilla porta, dui. Nam turpis.
Cleaned with "Remove all Word Specific markup" button and Word codes in red are still in HTML:
<p> Lorem ipsum dolor sit a<st1:personname w:st="on">me</st1:personname>t, consectetuer adipiscing elit. Vivamus pulvinar. Vestibulum varius elit et dui. Ut sit a<st1:personname w:st="on">me</st1:personname>t nulla. Quisque diam est, feugiat sit a<st1:personname w:st="on">me</st1:personname>t, egestas nec, fringilla porta, dui. <st1:country-region w:st="on"><st1:place w:st="on">Nam</st1:place></st1:country-region> turpis.</p>