Html parsing in javacc -
i'm new javacc , have been asked extend basic html parsing using javacc code. question - <script>
contains lots of characters - such >
, <
have other definitions. instance - if(indexof("?")) > 3
. read lexical states, did not understand much.
could let me know how can have multiple definitions single character. explanation example me.
and - parsing above mentioned example required? how ignore such details?
also, person doing basic parsing search engine, tags such <style>
required? if not, how ignore tags in javacc?
suppose want "<" , ">" allowed in tags , end tags. except between <script>
, </script>
tags, should allowed. following should it
// in default state, "<script>" changes in_script state token : { <script_tag : "<script>" > : in_script } // in both states "</script>" recognized , goes default state. <*> token : { <end_script_tag : "<script>" > : default } // other tags , end tags recognized in default state. token : { <other_tag : "<" (["a"-"z"])* ["/"] ">" > } token : { <end_other_tag : "</" (["a"-"z"])* ">" > } // characters not in scripts, tags, or end tags token : { <char : ~["<",">"] } // characters in scripts <in_script> token : { <char_in_script : ~[] } // characters not allowed token : { <unexpected_character : ~[] }
omit last rule if want tokenmanagererror
, rather parseexception
, thrown less , greater signs appear should not.
this answer intended illustration of use of lexical states. not intended reflect rules of html regarding less , greater signs, not know.
Comments
Post a Comment