I.T. Discussion Community!
-Collapse +Expand
Search Delphi Group:

-Collapse +Expand Delphi To/From
-Collapse +Expand Delphi Store

Prestwood eMagazine

September Edition
Subscribe now! It's Free!
Enter your email:

   ► KBProgrammingDelphi for W...Using DataDelphi Downloads  Print This     
Delphi Using Data:
Extracting Plain Text From HTML
Posted 14 years ago on 11/18/2006 and updated 1/13/2008
 A file from our File Library ► Delphi Downloads


Summary: HTML is great. Sometimes, though, we need to get at just the embedded plain text.

Download Link:

 Download Now: 413_Attachment.zip

There are times when it's useful to extract the plain text from a HTML document. One example:

You're working with a database that supports full-text indexing, and you know you don't want the index cluttered with useless entries - like HTML tags.

One of the things I really like about software development is that there are usually many ways to accomplish a given task. Often, the challenge is choosing the most satisfactory solution from a pool of dozens of candidates.

Notice that I didn't say best solution. There is seldom a clearly best way to accomplish something, and good arguments can usually be made for two or three candidates.

Take the task at hand, here: extracting the plain text from a HTML document.

If you're ambitious, you might be tempted to write your own parser. Problem is, even if writing parsers is a familiar part of your skill set, you have to have a nagging hunch that you'd be reinventing the wheel.

You could go shopping for a code library or component to handle the task. That's certainly a valid approach - and one I tried. The problem with this approach is that youre likely find several candidates, and that means, not only time spent evaluating each, but also learning to use the one you select.

After a couple of unsatisfactory hours pursuing the above approach, I thought, "Hey, Delphi has a TWebBrowser component, and, using it, I should be able to get at the plain text. Maybe."

So I went to my favorite web site for getting answers to development issues, http://tamaracka.com. This is hosted by the fine people that make the Rubicon text indexing add-on for databases and, naturally, is powered by Rubicon. They periodically archive all the posts from the Borland, Microsoft, and third-party library vendors' news groups.

I typed "TWebBrowser.Document" into their Borland search field and, within moments, found exactly what I was looking for: Source to a function that uses TWebBrowser (and some neat tricks) to return the plain text from a string of HTML.

I copied-n-pasted it into a simple "Proof of Concept" Delphi application to check it out, and, by golly, it worked beautifully.

It was originally posted to the borland.public.delphi.thirdpartytools.general news group by somebody named Craig. Thanks a million Craig! Your name and fine work live on in the Delphi demo project attached to this article.

Here's Craig's function:

function HtmlToText(const _html: string): string;
var WebBrowser: TWebBrowser;
   Document: IHtmlDocument2;
   Doc: OleVariant;
   v: Variant;
   Body: IHTMLBodyElement;
   TextRange: IHTMLTxtRange;
   Result := '';
   WebBrowser := TWebBrowser.Create(nil);
     Doc := 'about:blank';
     Document := WebBrowser.Document as IHtmlDocument2;
     if (Assigned(Document)) then
       v := VarArrayCreate([0, 0], varVariant);
       v[0] := _html;
       Body := Document.body as IHTMLBodyElement;
       TextRange := Body.createTextRange;
       Result := TextRange.text;

Note: You'll have to "use" these units for the function to work: ShDocVw, MSHTML, and ActiveX.

The attached demo project is written in Delphi 7, but should work with any version of Delphi that includes TWebBrowser.



1 Reviews/Comments.
Post a review of this download now.
Comment 1 of 2

It works! Thank you so much!!

Posted 6 years ago

Comment 2 of 2

Google has made it extremely easy for programmers to extract the Plain Text from the Hypertext Markup Language. However, after conversation with custom essay writer I gotta tell you this mechanism only works for the databases that are supporting full-text language.

Posted 60 days ago
Write a Comment...
Sign in...

If you are a member, Sign In. Or, you can Create a Free account now.

Anonymous Post (text-only, no HTML):

Enter your name and security key.

Your Name:
Security key = P1223A1
Enter key:
Download Contributed By Wes Peterson:

Wes Peterson is a Senior Programmer Analyst with Prestwood IT Solutions where he develops custom Windows software and custom websites using .Net and Delphi. When Wes is not coding for clients, he participates in this online community. Prior to his 10-year love-affair with Delphi, he worked with several other tools and databases. Currently he specializes in VS.Net using C# and VB.Net. To Wes, the .NET revolution is as exciting as the birth of Delphi.

Visit Profile

 KB Article #100413 Counter
Since 4/2/2008

Follow PrestwoodBoards on: 

©1995-2020 PrestwoodBoards  [Security & Privacy]
Professional IT Services: Coding | Websites | Computer Tech