Html parsing in javacc -


i'm new javacc , have been asked extend basic html parsing using javacc code. question - <script> contains lots of characters - such > , < have other definitions. instance - if(indexof("?")) > 3. read lexical states, did not understand much.

could let me know how can have multiple definitions single character. explanation example me.

and - parsing above mentioned example required? how ignore such details?

also, person doing basic parsing search engine, tags such <style> required? if not, how ignore tags in javacc?

suppose want "<" , ">" allowed in tags , end tags. except between <script> , </script> tags, should allowed. following should it

// in default state, "<script>" changes in_script state token : { <script_tag : "<script>" > : in_script } // in both states "</script>" recognized , goes default state.  <*> token : { <end_script_tag : "<script>" > : default } // other tags , end tags recognized in default state. token : { <other_tag : "<" (["a"-"z"])* ["/"] ">" > } token : { <end_other_tag : "</" (["a"-"z"])* ">" > } // characters not in scripts, tags, or end tags token : { <char : ~["<",">"] } // characters in scripts <in_script> token : { <char_in_script : ~[] } // characters not allowed token : { <unexpected_character : ~[] } 

omit last rule if want tokenmanagererror, rather parseexception, thrown less , greater signs appear should not.

this answer intended illustration of use of lexical states. not intended reflect rules of html regarding less , greater signs, not know.


Comments

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

c# - Get rid of xmlns attribute when adding node to existing xml -