Html parsing in javacc -
i'm new javacc , have been asked extend basic html parsing using javacc code. question - <script> contains lots of characters - such > , < have other definitions. instance - if(indexof("?")) > 3. read lexical states, did not understand much.
could let me know how can have multiple definitions single character. explanation example me.
and - parsing above mentioned example required? how ignore such details?
also, person doing basic parsing search engine, tags such <style> required? if not, how ignore tags in javacc?
suppose want "<" , ">" allowed in tags , end tags. except between <script> , </script> tags, should allowed. following should it
// in default state, "<script>" changes in_script state token : { <script_tag : "<script>" > : in_script } // in both states "</script>" recognized , goes default state. <*> token : { <end_script_tag : "<script>" > : default } // other tags , end tags recognized in default state. token : { <other_tag : "<" (["a"-"z"])* ["/"] ">" > } token : { <end_other_tag : "</" (["a"-"z"])* ">" > } // characters not in scripts, tags, or end tags token : { <char : ~["<",">"] } // characters in scripts <in_script> token : { <char_in_script : ~[] } // characters not allowed token : { <unexpected_character : ~[] } omit last rule if want tokenmanagererror, rather parseexception, thrown less , greater signs appear should not.
this answer intended illustration of use of lexical states. not intended reflect rules of html regarding less , greater signs, not know.
Comments
Post a Comment