sql :: reformatting HTML to plain text

I received a request from a colleague asking me if I could reformat HTML in a cell to be readable so we could send it out to a customer (mssql db). Initially, I thought “easy”. How wrong I was.
As per normal, I scoured the net, as I’m sure someone else has done this, and came across , which is awesome, but its too generic. It strips everything between < > tags. We have embedded xml sometimes and I still need that to be displayed. Looks like I have to put all the HTML tags in that need to be stripped, individually…

Then I found which is a bastardised version of the original. Good, but still not good enough.

So off I went to customise.
Here’s what i came up with.

CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX) AS
BEGIN
       DECLARE @START  INT
       DECLARE @END    INT
       DECLARE @LENGTH INT
 
       -- Replace the HTML entity &amp; with the '&' character (this needs to be done first, as
       -- '&' might be double encoded as '&amp;amp;')
       SET @START = CHARINDEX('&amp;', @HTMLText)
       SET @END = @START + 4
       SET @LENGTH = (@END - @START) + 1
 
       WHILE (@START > 0 AND @END > 0 AND @LENGTH > 0) BEGIN
              SET @HTMLText = STUFF(@HTMLText, @START, @LENGTH, '&')
              SET @START = CHARINDEX('&amp;', @HTMLText)
              SET @END = @START + 4
              SET @LENGTH = (@END - @START) + 1
       END
 
       -- Replace the HTML entity &lt; with the '<' character
       SET @START = CHARINDEX('&lt;', @HTMLText)
       SET @END = @START + 3
       SET @LENGTH = (@END - @START) + 1
 
       WHILE (@START > 0 AND @END > 0 AND @LENGTH > 0) BEGIN
              SET @HTMLText = STUFF(@HTMLText, @START, @LENGTH, '<')
              SET @START = CHARINDEX('&lt;', @HTMLText)
              SET @END = @START + 3
              SET @LENGTH = (@END - @START) + 1
       END
 
       -- Replace the HTML entity &gt; with the '>' character
       SET @START = CHARINDEX('&gt;', @HTMLText)
       SET @END = @START + 3
       SET @LENGTH = (@END - @START) + 1
 
       WHILE (@START > 0 AND @END > 0 AND @LENGTH > 0) BEGIN
              SET @HTMLText = STUFF(@HTMLText, @START, @LENGTH, '>')
              SET @START = CHARINDEX('&gt;', @HTMLText)
              SET @END = @START + 3
              SET @LENGTH = (@END - @START) + 1
       END
 
       -- Replace the HTML entity &amp; with the '&' character
       SET @START = CHARINDEX('&amp;amp;', @HTMLText)
       SET @END = @START + 4
       SET @LENGTH = (@END - @START) + 1
 
       WHILE (@START > 0 AND @END > 0 AND @LENGTH > 0) BEGIN
              SET @HTMLText = STUFF(@HTMLText, @START, @LENGTH, '&')
              SET @START = CHARINDEX('&amp;amp;', @HTMLText)
              SET @END = @START + 4
              SET @LENGTH = (@END - @START) + 1
       END
 
       -- Replace the HTML entity &nbsp; with the ' ' character
       SET @START = CHARINDEX('&nbsp;', @HTMLText)
       SET @END = @START + 5
       SET @LENGTH = (@END - @START) + 1
 
       WHILE (@START > 0 AND @END > 0 AND @LENGTH > 0) BEGIN
              SET @HTMLText = STUFF(@HTMLText, @START, @LENGTH, ' ')
              SET @START = CHARINDEX('&nbsp;', @HTMLText)
              SET @END = @START + 5
              SET @LENGTH = (@END - @START) + 1
       END
 
       -- Replace any <br> tags with a newline
       SET @START = CHARINDEX('<br>', @HTMLText)
       SET @END = @START + 3
       SET @LENGTH = (@END - @START) + 1
 
       WHILE (@START > 0 AND @END > 0 AND @LENGTH > 0) BEGIN
              SET @HTMLText = STUFF(@HTMLText, @START, @LENGTH, CHAR(13) + CHAR(10))
              SET @START = CHARINDEX('<br>', @HTMLText)
              SET @END = @START + 3
              SET @LENGTH = (@END - @START) + 1
       END
 
       -- Replace any <br/> tags with a newline
       SET @START = CHARINDEX('<br/>', @HTMLText)
       SET @END = @START + 4
       SET @LENGTH = (@END - @START) + 1
 
       WHILE (@START > 0 AND @END > 0 AND @LENGTH > 0) BEGIN
              SET @HTMLText = STUFF(@HTMLText, @START, @LENGTH, CHAR(13) + CHAR(10))
              SET @START = CHARINDEX('<br/>', @HTMLText)
              SET @END = @START + 4
              SET @LENGTH = (@END - @START) + 1
       END
 
       -- Replace any <br /> tags with a newline
       SET @START = CHARINDEX('<br />', @HTMLText)
       SET @END = @START + 5
       SET @LENGTH = (@END - @START) + 1
 
       WHILE (@START > 0 AND @END > 0 AND @LENGTH > 0) BEGIN
              SET @HTMLText = STUFF(@HTMLText, @START, @LENGTH, CHAR(13) + CHAR(10))
              SET @START = CHARINDEX('<br />', @HTMLText)
              SET @END = @START + 5
              SET @LENGTH = (@END - @START) + 1
       END
 
       -- Replace any <p> tags with a newline
       SET @START = CHARINDEX('<p>', @HTMLText)
       SET @END = @START + 2
       SET @LENGTH = (@END - @START) + 1
 
       WHILE (@START > 0 AND @END > 0 AND @LENGTH > 0) BEGIN
              SET @HTMLText = STUFF(@HTMLText, @START, @LENGTH, CHAR(13) + CHAR(10))
              SET @START = CHARINDEX('<p>', @HTMLText)
              SET @END = @START + 2
              SET @LENGTH = (@END - @START) + 1
       END
 
       -- Replace any <p> tags with a newline
       SET @START = CHARINDEX('</p>', @HTMLText)
       SET @END = @START + 3
       SET @LENGTH = (@END - @START) + 1
 
       WHILE (@START > 0 AND @END > 0 AND @LENGTH > 0) BEGIN
              SET @HTMLText = STUFF(@HTMLText, @START, @LENGTH, CHAR(13) + CHAR(10))
              SET @START = CHARINDEX('</p>', @HTMLText)
              SET @END = @START + 3
              SET @LENGTH = (@END - @START) + 1
       END
 
       -- Remove any <p > tags 
    SET @START = CHARINDEX('<p ',@HTMLText)
    SET @END = CHARINDEX('>',@HTMLText,CHARINDEX('<p ',@HTMLText))
    SET @LENGTH = (@END - @START) + 1
    WHILE @START > 0 AND @END > 0 AND @LENGTH > 0
    BEGIN
        SET @HTMLText = STUFF(@HTMLText,@START,@LENGTH,'')
        SET @START = CHARINDEX('<p ',@HTMLText)
        SET @END = CHARINDEX('>',@HTMLText,CHARINDEX('<p ',@HTMLText))
        SET @LENGTH = (@END - @START) + 1
    END
 
       -- Remove any <span> tags 
    SET @START = CHARINDEX('<span',@HTMLText)
    SET @END = CHARINDEX('>',@HTMLText,CHARINDEX('<span',@HTMLText))
    SET @LENGTH = (@END - @START) + 1
    WHILE @START > 0 AND @END > 0 AND @LENGTH > 0
    BEGIN
        SET @HTMLText = STUFF(@HTMLText,@START,@LENGTH,'')
        SET @START = CHARINDEX('<span',@HTMLText)
        SET @END = CHARINDEX('>',@HTMLText,CHARINDEX('<span',@HTMLText))
        SET @LENGTH = (@END - @START) + 1
    END
 
    SET @START = CHARINDEX('</span',@HTMLText)
    SET @END = CHARINDEX('>',@HTMLText,CHARINDEX('</span',@HTMLText))
    SET @LENGTH = (@END - @START) + 1
    WHILE @START > 0 AND @END > 0 AND @LENGTH > 0
    BEGIN
        SET @HTMLText = STUFF(@HTMLText,@START,@LENGTH,'')
        SET @START = CHARINDEX('</span',@HTMLText)
        SET @END = CHARINDEX('>',@HTMLText,CHARINDEX('</span',@HTMLText))
        SET @LENGTH = (@END - @START) + 1
    END
 
    SET @START = CHARINDEX('<u>',@HTMLText)
    SET @END = @START+2
    SET @LENGTH = (@END - @START) + 1
    WHILE @START > 0 AND @END > 0 AND @LENGTH > 0
    BEGIN
        SET @HTMLText = STUFF(@HTMLText,@START,@LENGTH,'')
        SET @START = CHARINDEX('<u>',@HTMLText)
        SET @END = @START+2
        SET @LENGTH = (@END - @START) + 1
    END
 
    SET @START = CHARINDEX('</u>',@HTMLText)
    SET @END = @START+3
    SET @LENGTH = (@END - @START) + 1
    WHILE @START > 0 AND @END > 0 AND @LENGTH > 0
    BEGIN
        SET @HTMLText = STUFF(@HTMLText,@START,@LENGTH,'')
        SET @START = CHARINDEX('</u>',@HTMLText)
        SET @END = @START+3
        SET @LENGTH = (@END - @START) + 1
    END
 
    SET @START = CHARINDEX('<b>',@HTMLText)
    SET @END = @START+2
    SET @LENGTH = (@END - @START) + 1
    WHILE @START > 0 AND @END > 0 AND @LENGTH > 0
    BEGIN
        SET @HTMLText = STUFF(@HTMLText,@START,@LENGTH,'')
        SET @START = CHARINDEX('<b>',@HTMLText)
        SET @END = @START+2
        SET @LENGTH = (@END - @START) + 1
    END
 
    SET @START = CHARINDEX('</b>',@HTMLText)
    SET @END = @START+3
    SET @LENGTH = (@END - @START) + 1
    WHILE @START > 0 AND @END > 0 AND @LENGTH > 0
    BEGIN
        SET @HTMLText = STUFF(@HTMLText,@START,@LENGTH,'')
        SET @START = CHARINDEX('</b>',@HTMLText)
        SET @END = @START+3
        SET @LENGTH = (@END - @START) + 1
    END
 
    SET @START = CHARINDEX('</a>',@HTMLText)
    SET @END = @START+3
    SET @LENGTH = (@END - @START) + 1
    WHILE @START > 0 AND @END > 0 AND @LENGTH > 0
    BEGIN
        SET @HTMLText = STUFF(@HTMLText,@START,@LENGTH,'')
        SET @START = CHARINDEX('</a>',@HTMLText)
        SET @END = @START+3
        SET @LENGTH = (@END - @START) + 1
    END
 
    RETURN LTRIM(RTRIM(@HTMLText))
END

Use it like this

SELECT dbo.udf_StripHTML(AFS_Request.Requirements) FROM blah

and you’re set.

You can add your own set of tags by copying and modifying trim blocks, for the most part, it should make your HTML cell very readable.

enjoy!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>