Html parsing in javacc -

- August 15, 2010

i'm new javacc , have been asked extend basic html parsing using javacc code. question - <script> contains lots of characters - such > , < have other definitions. instance - if(indexof("?")) > 3. read lexical states, did not understand much.

could let me know how can have multiple definitions single character. explanation example me.

and - parsing above mentioned example required? how ignore such details?

also, person doing basic parsing search engine, tags such <style> required? if not, how ignore tags in javacc?

suppose want "<" , ">" allowed in tags , end tags. except between <script> , </script> tags, should allowed. following should it

// in default state, "<script>" changes in_script state token : { <script_tag : "<script>" > : in_script } // in both states "</script>" recognized , goes default state.  <*> token : { <end_script_tag : "<script>" > : default } // other tags , end tags recognized in default state. token : { <other_tag : "<" (["a"-"z"])* ["/"] ">" > } token : { <end_other_tag : "</" (["a"-"z"])* ">" > } // characters not in scripts, tags, or end tags token : { <char : ~["<",">"] } // characters in scripts <in_script> token : { <char_in_script : ~[] } // characters not allowed token : { <unexpected_character : ~[] }

omit last rule if want tokenmanagererror, rather parseexception, thrown less , greater signs appear should not.

this answer intended illustration of use of lexical states. not intended reflect rules of html regarding less , greater signs, not know.

Search This Blog

ITEMscalal

Html parsing in javacc -

Comments

Post a Comment

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

python - RuntimeWarning: PyOS_InputHook is not available for interactive use of PyGTK -

Fatal error: Call to undefined function menu_execute_active_handler() in drupal 7.9 -