How to strip text bodies from HTML pages?

Author

Message

hakimfullmetal

11

Years of Service

User Offline

Joined: 17th Feb 2015

Location:

Posted: 19th Dec 2016 23:39 Edited at: 19th Dec 2016 23:40

Link

Hello guys.
Let me just say that I'm a zero at HTML.

I wanted to salvages texts from various websites, and display them in my game.
So I've downloaded several webpages.

But it seems that all the webpages have different tags enclosing the text body.
For example, descriptive texts in Wikipedia was wrapped with different tags compared to [url]www.thegamecreators.com[/url].

Quote: "<meta name="twitter:description" content="Nenesha was Infel's friend and the 14th Maiden of Homura/Fuero. Four hundread years ago, she wanted to create Metafalica with Infel but their plan failed. Infel opened her Heart to everyone but..." />"

Quote: "<div class="description">New to AppGameKit? Fear not for there are people on-hand to help you out in here.</div>"

Is there a universal tags that I can use as markers so I can strip the texts in-between those tags?
Just the descriptive texts, or titles, that people can read.
I can't grasp the common patters/tags that wraps the texts, so I cannot use them as markers to strip the texts in them.

The only one that I know of looks like this:

Quote: "<p>
You can use all the well known tags and CSS properties to format text, fonts and
colors.
</p>"

Is there any common tags that can help me to identify text bodies to help with text stripping? Or rules that may simplify this?

Back to top

Profile PM

Chris Tate

DBPro Master

17

Years of Service

User Offline

Joined: 29th Aug 2008

Location: London, England

Posted: 20th Dec 2016 01:46

Link

There are so many HTML tags and scripts out there that it would be very tedious to write parsers to interpret them all.

For example; <p> as you indicated is paragraph.
<h1> is heading 1, <h2> is heading two
<b> is bold text (the old way),
<strong> is also bold
<td>table cell
<span> A span of text (usually)
<div>divider

all of these could contain text within them, but that's not all of the elements and ,some text based elements are nested in non-text based elements so this is just the tip of the iceberg.

If you could use a WebBrowser control, a simple call to Control.InnerText would return string containing all the text in the website; in an ideal world you would not need to write the HTML parser yourself.

If you for some reason wanted to interpret all the elements in HTML (bypassing non-text elements), then here is a list of their descriptions: http://www.w3schools.com/tags/default.asp

Back to top

Profile PM Email Website

hakimfullmetal

11

Years of Service

User Offline

Joined: 17th Feb 2015

Location:

Posted: 20th Dec 2016 02:19

Link

Quote: "If you could use a WebBrowser control, a simple call to Control.InnerText would return string containing all the text in the website; in an ideal world you would not need to write the HTML parser yourself."

Err how do I use the WebBrowser control?
External DLL?

WickedX used to mention some DLL used by Internet Explorer. Is that it?

Quote: "urlmon.dll
browseui.dll
ieframe.dll
iertutil.dll
mshtml.dll
shdocvw.dll
urlmon.dll
wininet.dll"

Back to top

Profile PM

Ortu

DBPro Master

18

Years of Service

User Offline

Joined: 21st Nov 2007

Location: Austin, TX

Posted: 20th Dec 2016 03:47

Link

Parsing arbitrary and unknown html is frankly a nightmare, and is often against many sites terms of use if you are automating it in high volume.

Ideally you want to use a site or service that exposes an api that will return the data you are after directly in either json or xml. These are consumable formats, html is a display format.

Of course this isn't always possible, but what you are trying to do will always be unreliable at best.

A single player RPG featuring a branching, player driven storyline of meaningful choices and multiple endings alongside challenging active combat and intelligent AI.
http://games.joshkirklin.com/sulium

Back to top

Profile PM Email Website

Kevin Picone

23

Years of Service

User Offline

Joined: 27th Aug 2002

Location: Australia

Posted: 20th Dec 2016 10:44 Edited at: 20th Dec 2016 10:46

Link

Here's an example I wrote a while back.
Strip Html From String

PlayBASIC To HTML5/WEB - Convert PlayBASIC To Machine Code

Back to top

Profile PM Website

hakimfullmetal

11

Years of Service

User Offline

Joined: 17th Feb 2015

Location:

Posted: 20th Dec 2016 12:26 Edited at: 21st Dec 2016 00:44

Link

Thank you Ortu. It's good to know that before I go on a wild goose chase.
I guess I'll have to really get into studying HTML if I wanted to present the texts in more elegant ways.

Kevin Picone, that code would be a lifesaver.
I tried to search for PlayBasic commands listing, but the only one I can find are without parameters, so I don't completely understand what the codes do.
Do you have access to PlayBasic command listing with their parameters?
Or is there any other way besides installing PlayBasic?

EDIT: Nvm I was being lazy I guess. I downloaded PlayBasic.

Back to top

Profile PM

hakimfullmetal

11

Years of Service

User Offline

Joined: 17th Feb 2015

Location:

Posted: 21st Dec 2016 00:40 Edited at: 21st Dec 2016 03:02

Link

Here's the HTML stripper originally made by Kevin Picone in PlayBASIC, edited into DBPRo-friendly format.
ALso requires IanM MatrixUtils plugin.
Just dumping it here in case anybody wanted it later.

Rem Project: TEST HTML Stripper
Rem Created: Wednesday, December 21, 2016

Rem ***** Main Source File *****

`   ----------------------------------------------------------------------------
 `   ----------------------------------------------------------------------------
 `   --{ STRIP HTML FROM STRING }-------------------------------------------
 `   ----------------------------------------------------------------------------
 `   ----------------------------------------------------------------------------
   REMSTART
      This function skims the input string and strips anything that looks like
      a html TAG.  The code only supports a few paired tags, so if you need more
      add them to the select statement in the middle.  
   REMEND   
   
   Html$ = ""
   HtmlNew$ = ""
   
   open to read 1, "webpage.txt"   
   
   WHILE FILE END(1) = 0
   `IF FILE END(1) = 0
    read string 1, HtmlNew$
    Html$ = Html$ +  HtmlNew$ 
   `ENDIF
   ENDWHILE
   
   CleanText$=Strip_Html_From_String(Html$)
   
   sync on: sync rate 60
        desktopwidth=desktop width()
        desktopheight=desktop height()     
        set display mode desktopwidth, desktopheight,32,1    
        Set window position -2,-18
 
    backdrop on
    color backdrop 0 
    disable escapekey
    
`#######################################################################################################################################################################################
`#######################################################################################################################################################################################
`#######################################################################################################################################################################################   
    
   
DO
set cursor 0, 0
PRINT CleanText$ 
SYNC    
LOOP

`#######################################################################################################################################################################################
`#######################################################################################################################################################################################
`#######################################################################################################################################################################################
   
   
      
Function Strip_Html_From_String(Html$)

HtmlSize=Len(Html$)
   
   TextOutput$=""

for lp=1 to HtmlSize
   
      `ThisChr=mid(Html$,lp)
      ThisChr=mid ASCII(Html$,lp)
      `ThisChrSTR$ = MID$(Html$,lp)
      `ThisChr=ASC(ThisChrSTR$)
      
      
      if Thischr=asc("<")

if Lp+2 <= HtmlSize

`NextChr=mid(Html$,lp+1)
               NextChr=mid ASCII(Html$,lp+1)
               `NextChrSTR$ = MID$(Html$,lp+1)
               `NextChr=ASC(NextChrSTR$)

if NextChr=asc("?")
                     // DETECT COMMENT TAG, if so, find closing and skip completely
                     closetag=INSTR(html$,"?>",lp)                     
                     if closeTag>lp
                                 Lp=closetag+2
                                 `continue
                                 `EXITFUNCTION
                     endif
               endif

// check if Next char is the closing tag (assuming it's tight against the less than chr
               if NextChr=asc("/")
                     `NextChr=mid(Html$,lp+2)
                     NextChr=mid ASCII(Html$,lp+2)
                     `NextChrSTR$ = MID$(Html$,lp+2)
                     `NextChr=ASC(NextChrSTR$)
               endif

// ----------------------------------
               // Is the next character a DOCTYPE ?
               // ----------------------------------
               if NextChr=asc("!")
                     `NextChr=mid(Html$,lp+2)
                     NextChr=mid ASCII(Html$,lp+2)
                     `NextChrSTR$ = MID$(Html$,lp+2)
                     `NextChr=ASC(NextChrSTR$)
                      
                     //this might be a comment
                     `if mid(Html$,lp+2)=asc("-")
                     if mid ASCII(Html$,lp+2)=asc("-")
                        `if mid(Html$,lp+3)=asc("-")
                        if mid ASCII(Html$,lp+3)=asc("-")
                              
                              // DETECT COMMENT TAG, if so, find closing and skip completely
                              `closetag=instring(html$,"-->",lp)
                              closetag=INSTR(html$,"-->",lp)
                              if closeTag>lp
                                       Lp=closetag+2
                                       `continue
                                       `EXITFUNCTION
                              endif
                        endif
                     endif

endif

if (NextChr=>asc("a") and NextChr<=asc("z"))  or (NextChr=>asc("A") and NextChr<=asc("Z"))
            
                  // look            
                  `CloseTag=instring(Html$,">",lp+1)
                  CloseTag=INSTR(Html$,">",lp+1)
                  if CloseTag>lp
                  
                     WhiteSpaceFound= 0

//find the first white charcter after the alphabet chr, might be 
                     for SearchLP=lp+1 to CloseTag
                           `FindChr=mid(html$,Searchlp)
                           FindChr=mid ASCII(html$,Searchlp)
                           if Findchr=32 or  findchr=9
                                 WhiteSpaceFound =SearchLP
                                 `exitfor SearchLP    
                                 SearchLP = CloseTag           
                           endif 
                     next

if WhiteSpaceFound>0
                           // looks like tag
                           Tag$=MID$(Html$,lp+1,WhiteSpaceFound-lp) 
                     else
                           Tag$=MID$(Html$,lp+1,CloseTag-(lp+1)) 
                     endif

// --------------------------------------------------------                     
                     // TRAP TAGS and parse out any properties you might want
                     // --------------------------------------------------------      
                     
                     FindClosingTag=0
                         
                                    
                     tag$=trim$(Tag$)
                     `tag$=REMOVE ALL$(Tag$)
                                          
                                    
                     select upper$(tag$)
                     // --------------------------------------------------------                     
                     
                           // --------------------------------------------------------                     
                           case "IMG"
                           // --------------------------------------------------------                     
                                 // grab the coplete tag,
                                 `FullTag$=mid$(Html$,lp+1,CloseTag-(lp+1)) 
                                 FullTag$=MID$(Html$,lp+1,CloseTag-(lp+1))

// pull out alternate text if you need it here
                                 AltString$=GetProperty(FullTag$,"alt")
                                 
                                 `TextOutput$+=" "+AltString$+" "
                                 TextOutput$ =TextOutput$ + " "+AltString$+" "
                                 
                              //   Filename$=GetProperty(FullTag$,"src")
                              //   filename$=getfilename$(Filename$)
                              //   TextOutput$+=" "+filename$+" "
                              ENDCASE

// --------------------------------------------------------                     
                           case "SCRIPT","STYLE"
                           // --------------------------------------------------------                     
                                 // Handle PAIRED TAGS, so we're assuming everything between
                                 // this tags closing statement is junk and be removed   
                                    `FindClosingTag= true
                                    FindClosingTag= 1
                              ENDCASE
                     EndSelect

lp=closeTag

`if FindClosingTag=true
                     if FindClosingTag = 1
                              // assuming closing tag is in the same form  it might be in < /tag>
                              endtag$="<"+"/"+tag$+">"
                              
                              `endtagpos=instring(html$,Endtag$,closetag+1)
                              endtagpos=INSTR(html$,Endtag$,closetag+1)
                              if EndTagPos>CloseTag
                                    lp=EndTagPos+len(endtag$)-1
                              endif
                     endif

else
                     // no closing > found so just output this as a char
                     goto OutputCHR
                  
                  endif
            
               else
                     // this seems to be a stand alone < char and not a tag
                     goto OutputCHR
               endif
            
            else
                     goto OutputCHR
            // are we more than 2 chrs from end ?
            endif

else

OutputCHR:            
            // drop this charcter to the output string
            `TextOutput$+=Chr$(ThisChr)   
            `TextOutput$=Chr$(ThisChr)  
            TextOutput$=TextOutput$ + Chr$(ThisChr)

// endof of < check
      endif

Done:               
   next

// brute force replace common character set encodings
   TextOutput$=REPLACE ALL$(TextOutput$,"&"+"nbsp;"," ")
   TextOutput$=REPLACE ALL$(TextOutput$,"&"+"lt;","<")
   TextOutput$=REPLACE ALL$(TextOutput$,"&"+"gt;",">")
   TextOutput$=REPLACE ALL$(TextOutput$,chr$(13)+chr$(10),"")
   TextOutput$=REPLACE ALL$(TextOutput$,chr$(10),"")
   
   Textoutput$=Single_Space_String(TextOutput$)

// search for rip doubl

EndFunction TextOutput$

`#######################################################################################################################################################################################
FUNCTION GetProperty(Tag$,Property$)

StartTag$=Property$+"="+chr$(34)
      `Startpos=instring(tag$,StartTag$)
      Startpos=INSTR(tag$,StartTag$)
      
      if Startpos
         StartPos=len(StartTag$)
         `Endpos   =instring(tag$,chr$(34),startpos)
         Endpos   =INSTR(tag$,chr$(34),startpos)
         if EndPos>StartPOs
               Result$=MID$(Tag$,StartPOs,EndPos-StartPOs)               
               goto done   
               ELSE
               `result$=""               
         endif
      endif
      
      result$=""
         
Done:
ENDFUNCTION Result$
`#######################################################################################################################################################################################
FUNCTION Single_Space_String(S$)

result$=""
   Size=Len(s$)
   for lp=1 to size
         `Thischr=mid(s$,lp)
         Thischr=mid ASCII(s$,lp)
         if ThisChr=32 or ThisChr=9

`NExtchr=mid(s$,lp+1)
               NExtchr=mid ASCII(s$,lp+1)
               if NextChr=32 or NextChr=9
                  for skiplp=lp+1 to Size
                        `NExtchr=mid(s$,SkipLP)
                        NExtchr=mid ASCII(s$,SkipLP)
                        if NextChr=32 or NextChr=9
                           lp=SkipLP      
                        else
                           `exitfor skiplp
                           skiplp=Size
                        endif
                  next

endif
               
               // output space   
               result$=result$ + chr$(32)

else
            result$= result$ + chr$(ThisChr)   
         endif

ENDFUNCTION result$
`#######################################################################################################################################################################################

+ Code Snippet

Rem Project: TEST HTML Stripper
Rem Created: Wednesday, December 21, 2016

Rem ***** Main Source File *****

 `   ----------------------------------------------------------------------------
 `   ----------------------------------------------------------------------------
 `   --{ STRIP HTML FROM STRING }-------------------------------------------
 `   ----------------------------------------------------------------------------
 `   ----------------------------------------------------------------------------
   REMSTART
      This function skims the input string and strips anything that looks like
      a html TAG.  The code only supports a few paired tags, so if you need more
      add them to the select statement in the middle.  
   REMEND   
   
   Html$ = ""
   HtmlNew$ = ""
   
   open to read 1, "webpage.txt"   
   
   WHILE FILE END(1) = 0
   `IF FILE END(1) = 0
    read string 1, HtmlNew$
    Html$ = Html$ +  HtmlNew$ 
   `ENDIF
   ENDWHILE
   
   CleanText$=Strip_Html_From_String(Html$)
   
   sync on: sync rate 60
        desktopwidth=desktop width()
        desktopheight=desktop height()     
        set display mode desktopwidth, desktopheight,32,1    
        Set window position -2,-18
 
    backdrop on
    color backdrop 0 
    disable escapekey
    
`#######################################################################################################################################################################################
`#######################################################################################################################################################################################
`#######################################################################################################################################################################################   
    
   
DO
set cursor 0, 0
PRINT CleanText$ 
SYNC    
LOOP

`#######################################################################################################################################################################################
`#######################################################################################################################################################################################
`#######################################################################################################################################################################################
   
   
      
Function Strip_Html_From_String(Html$)

   HtmlSize=Len(Html$)
   
   TextOutput$=""

   for lp=1 to HtmlSize
   
      `ThisChr=mid(Html$,lp)
      ThisChr=mid ASCII(Html$,lp)
      `ThisChrSTR$ = MID$(Html$,lp)
      `ThisChr=ASC(ThisChrSTR$)
      
      
      if Thischr=asc("<")

            if Lp+2 <= HtmlSize

               `NextChr=mid(Html$,lp+1)
               NextChr=mid ASCII(Html$,lp+1)
               `NextChrSTR$ = MID$(Html$,lp+1)
               `NextChr=ASC(NextChrSTR$)

               if NextChr=asc("?")
                     // DETECT COMMENT TAG, if so, find closing and skip completely
                     closetag=INSTR(html$,"?>",lp)                     
                     if closeTag>lp
                                 Lp=closetag+2
                                 `continue
                                 `EXITFUNCTION
                     endif
               endif

               // check if Next char is the closing tag (assuming it's tight against the less than chr
               if NextChr=asc("/")
                     `NextChr=mid(Html$,lp+2)
                     NextChr=mid ASCII(Html$,lp+2)
                     `NextChrSTR$ = MID$(Html$,lp+2)
                     `NextChr=ASC(NextChrSTR$)
               endif

               // ----------------------------------
               // Is the next character a DOCTYPE ?
               // ----------------------------------
               if NextChr=asc("!")
                     `NextChr=mid(Html$,lp+2)
                     NextChr=mid ASCII(Html$,lp+2)
                     `NextChrSTR$ = MID$(Html$,lp+2)
                     `NextChr=ASC(NextChrSTR$)
                      
                     //this might be a comment
                     `if mid(Html$,lp+2)=asc("-")
                     if mid ASCII(Html$,lp+2)=asc("-")
                        `if mid(Html$,lp+3)=asc("-")
                        if mid ASCII(Html$,lp+3)=asc("-")
                              
                              // DETECT COMMENT TAG, if so, find closing and skip completely
                              `closetag=instring(html$,"-->",lp)
                              closetag=INSTR(html$,"-->",lp)
                              if closeTag>lp
                                       Lp=closetag+2
                                       `continue
                                       `EXITFUNCTION
                              endif
                        endif
                     endif

               endif


               if (NextChr=>asc("a") and NextChr<=asc("z"))  or (NextChr=>asc("A") and NextChr<=asc("Z"))
            
                  // look            
                  `CloseTag=instring(Html$,">",lp+1)
                  CloseTag=INSTR(Html$,">",lp+1)
                  if CloseTag>lp
                  
                     WhiteSpaceFound= 0

                     //find the first white charcter after the alphabet chr, might be 
                     for SearchLP=lp+1 to CloseTag
                           `FindChr=mid(html$,Searchlp)
                           FindChr=mid ASCII(html$,Searchlp)
                           if Findchr=32 or  findchr=9
                                 WhiteSpaceFound =SearchLP
                                 `exitfor SearchLP    
                                 SearchLP = CloseTag           
                           endif 
                     next

                     if WhiteSpaceFound>0
                           // looks like tag
                           Tag$=MID$(Html$,lp+1,WhiteSpaceFound-lp) 
                     else
                           Tag$=MID$(Html$,lp+1,CloseTag-(lp+1)) 
                     endif
                     

                     // --------------------------------------------------------                     
                     // TRAP TAGS and parse out any properties you might want
                     // --------------------------------------------------------      
                     
                     FindClosingTag=0
                         
                                    
                     tag$=trim$(Tag$)
                     `tag$=REMOVE ALL$(Tag$)
                                          
                                    
                     select upper$(tag$)
                     // --------------------------------------------------------                     
                     
                           // --------------------------------------------------------                     
                           case "IMG"
                           // --------------------------------------------------------                     
                                 // grab the coplete tag,
                                 `FullTag$=mid$(Html$,lp+1,CloseTag-(lp+1)) 
                                 FullTag$=MID$(Html$,lp+1,CloseTag-(lp+1)) 

                                 // pull out alternate text if you need it here
                                 AltString$=GetProperty(FullTag$,"alt")
                                 
                                 `TextOutput$+=" "+AltString$+" "
                                 TextOutput$ =TextOutput$ + " "+AltString$+" "
                                 
                              //   Filename$=GetProperty(FullTag$,"src")
                              //   filename$=getfilename$(Filename$)
                              //   TextOutput$+=" "+filename$+" "
                              ENDCASE   


                           // --------------------------------------------------------                     
                           case "SCRIPT","STYLE"
                           // --------------------------------------------------------                     
                                 // Handle PAIRED TAGS, so we're assuming everything between
                                 // this tags closing statement is junk and be removed   
                                    `FindClosingTag= true
                                    FindClosingTag= 1
                              ENDCASE
                     EndSelect


                  
                     lp=closeTag

                     `if FindClosingTag=true
                     if FindClosingTag = 1
                              // assuming closing tag is in the same form  it might be in < /tag>
                              endtag$="<"+"/"+tag$+">"
                              
                              `endtagpos=instring(html$,Endtag$,closetag+1)
                              endtagpos=INSTR(html$,Endtag$,closetag+1)
                              if EndTagPos>CloseTag
                                    lp=EndTagPos+len(endtag$)-1
                              endif
                     endif
   

                  else
                     // no closing > found so just output this as a char
                     goto OutputCHR
                  
                  endif
            
               else
                     // this seems to be a stand alone < char and not a tag
                     goto OutputCHR
               endif
            
            else
                     goto OutputCHR
            // are we more than 2 chrs from end ?
            endif

      else

OutputCHR:            
            // drop this charcter to the output string
            `TextOutput$+=Chr$(ThisChr)   
            `TextOutput$=Chr$(ThisChr)  
            TextOutput$=TextOutput$ + Chr$(ThisChr)   

      // endof of < check
      endif


Done:               
   next


   // brute force replace common character set encodings
   TextOutput$=REPLACE ALL$(TextOutput$,"&"+"nbsp;"," ")
   TextOutput$=REPLACE ALL$(TextOutput$,"&"+"lt;","<")
   TextOutput$=REPLACE ALL$(TextOutput$,"&"+"gt;",">")
   TextOutput$=REPLACE ALL$(TextOutput$,chr$(13)+chr$(10),"")
   TextOutput$=REPLACE ALL$(TextOutput$,chr$(10),"")
   
   Textoutput$=Single_Space_String(TextOutput$)

   // search for rip doubl


EndFunction TextOutput$

`#######################################################################################################################################################################################
FUNCTION GetProperty(Tag$,Property$)

      StartTag$=Property$+"="+chr$(34)
      `Startpos=instring(tag$,StartTag$)
      Startpos=INSTR(tag$,StartTag$)
      
      if Startpos
         StartPos=len(StartTag$)
         `Endpos   =instring(tag$,chr$(34),startpos)
         Endpos   =INSTR(tag$,chr$(34),startpos)
         if EndPos>StartPOs
               Result$=MID$(Tag$,StartPOs,EndPos-StartPOs)               
               goto done   
               ELSE
               `result$=""               
         endif
      endif
      
      result$=""
         
Done:
ENDFUNCTION Result$
`#######################################################################################################################################################################################
FUNCTION Single_Space_String(S$)

   result$=""
   Size=Len(s$)
   for lp=1 to size
         `Thischr=mid(s$,lp)
         Thischr=mid ASCII(s$,lp)
         if ThisChr=32 or ThisChr=9

               `NExtchr=mid(s$,lp+1)
               NExtchr=mid ASCII(s$,lp+1)
               if NextChr=32 or NextChr=9
                  for skiplp=lp+1 to Size
                        `NExtchr=mid(s$,SkipLP)
                        NExtchr=mid ASCII(s$,SkipLP)
                        if NextChr=32 or NextChr=9
                           lp=SkipLP      
                        else
                           `exitfor skiplp
                           skiplp=Size
                        endif
                  next

               endif
               
               // output space   
               result$=result$ + chr$(32)   

         else
            result$= result$ + chr$(ThisChr)   
         endif   

   next

ENDFUNCTION result$
`#######################################################################################################################################################################################

Replace the webpage.txt in open to read 1, "webpage.txt" with any HTML files you want to strip. Put it in your project folder
You can try it with this file:

Attachments

webpage.txt

Back to top

Profile PM

Chris Tate

DBPro Master

17

Years of Service

User Offline

Joined: 29th Aug 2008

Location: London, England

Posted: 21st Dec 2016 02:48

Link

Not bad for 321 lines of code aye!

Back to top

Profile PM Email Website

Sorry your browser is not supported!

DarkBASIC Professional Discussion / How to strip text bodies from HTML pages?

Attachments