nlp - How to set delimiters for PTB tokenizer? -


i'm using stanfordcore nlp library project.it uses ptb tokenizer tokenization.for statement goes this- go room no. #2145 or

go room no. *2145 

tokenizer splitting #2145 2 tokens: #,2145. there way possible set tokenizer does't identify #,* delimiter?

a quick solution use option:

(command-line) -tokenize.whitespace (in java code) props.setproperty("tokenize.whitespace", "true"); 

this cause tokenizer tokenize on white space. need other tokenize on white space?


Comments

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

c# - Get rid of xmlns attribute when adding node to existing xml -