nlp - How to set delimiters for PTB tokenizer? -
i'm using stanfordcore nlp library project.it uses ptb tokenizer tokenization.for statement goes this- go room no. #2145
or
go room no. *2145
tokenizer splitting #2145 2 tokens: #,2145. there way possible set tokenizer does't identify #,* delimiter?
a quick solution use option:
(command-line) -tokenize.whitespace (in java code) props.setproperty("tokenize.whitespace", "true");
this cause tokenizer tokenize on white space. need other tokenize on white space?
Comments
Post a Comment