xml - Extract substrings with XSLT -
i doing xml-> tsv converion using xslt 2.0.
data
my xml file looks this:
<?xml version="1.0" encoding="utf-8"?> <!doctype tei.2 system "lemmatizzazione.dtd"> <?xml-stylesheet type="text/xsl" href="dh.xsl"?> <root> <l> <lm lemma="me" catg="ss">i</lm> <lm lemma="would" catg="vv">would</lm> <lm lemma="like" catg="vv">like</lm> <lm lemma="to" catg="vv">to</lm> <lm lemma="to have" catg="vv">have,</lm> </l> <l> <lm lemma="this" catg="ad">this</lm> <lm lemma="bad" catg="e">in to</lm> <lm lemma="a" catg="e">a</lm> <lm lemma="ts" catg="ad">tsv</lm> <lm lemma="for" catg="nn">format</lm>, </l> <l> <lm1> <lm lemma="but1" catg="x01">but</lm> <lm lemma="but2" catg="x02">but</lm> </lm1> <lm lemma="" catg="">first</lm> <lm lemma="" catg="">there are</lm> <lm1> <lm lemma="rxs" catg="fff">a</lm> <lm lemma="tre" catg="ds">a</lm> <lm lemma="asq" catg="rt">a</lm> </lm1> "<lm lemma="few" catg="e">few</lm> </l> <l> <lm>problems</lm> <lm>to</lm> <lm>solve.</lm> <lm>here</lm> <lm>and</lm> </l> <l> <lm lemma="there" catg="e">there</lm>. <lm lemma="bad" catg="ee">this is</lm> <lm lemma="bad2" catg="ee">multi word</lm> <lm lemma="good" catg="nn">expression.</lm> </l> </root>
desired output
the desired output following:
token lemma catg l's me ss 1 vv 1 vv 1 vv 1 have have vv 1 , 1 ... format nn 2 , 2 but1but2 x01x02 3
my code
my actual xslt this:
<?xml version="1.0"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/xsl/transform"> <xsl:strip-space elements="*"/> <xsl:output method="text" encoding="utf-8"/> <xsl:template match="lm"> <xsl:choose> <xsl:when test="contains(.,' ')"> <xsl:call-template name="tokenize"> <xsl:with-param name="mwe" select="current()"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:choose> <xsl:when test="matches(substring(.,string-length(.)),'\.|,|;')"> <xsl:call-template name="remove_punctuation"> <xsl:with-param name="token" select="."/> <xsl:with-param name="mark" select="substring(.,string-length(.))"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:value-of select="." /> <xsl:call-template name="ids"/> </xsl:otherwise> </xsl:choose> </xsl:otherwise> </xsl:choose> </xsl:template> <xsl:template match="lm1"> <xsl:value-of select="lm" /> <xsl:call-template name="multi_ids" /> </xsl:template> <xsl:template name="ids"> <xsl:text>	</xsl:text> <xsl:value-of select="./@lemma"/><xsl:text>	</xsl:text> <xsl:value-of select="./@catg"/><xsl:text>	</xsl:text> <xsl:value-of select="count(../preceding-sibling::l) + 1"/><xsl:text> </xsl:text> </xsl:template> <xsl:template name="multi_ids"> <xsl:text>	</xsl:text> <xsl:for-each select="lm"> <xsl:value-of select="./@lemma"/> </xsl:for-each><xsl:text>	</xsl:text> <xsl:for-each select="lm"> <xsl:value-of select="./@catg"/> </xsl:for-each><xsl:text>	</xsl:text> <xsl:value-of select="count(../preceding-sibling::l) + 1"/><xsl:text> </xsl:text> </xsl:template> <xsl:template name="tokenize"> <xsl:param name="mwe"/> <xsl:for-each select="tokenize($mwe/.,'\s+')"> <xsl:value-of select="normalize-space(.)"/><xsl:text>	</xsl:text> <xsl:value-of select="$mwe/./@lemma"/><xsl:text>	</xsl:text> <xsl:value-of select="$mwe/./@catg"/><xsl:text>	</xsl:text> <xsl:value-of select="count($mwe/../preceding-sibling::l) + 1"/><xsl:text> </xsl:text> </xsl:for-each> </xsl:template> <xsl:template name="remove_punctuation"> <xsl:param name="token"/> <xsl:param name="mark"/> <xsl:for-each select="tokenize($token/.,$mark)"> <xsl:value-of select="normalize-space(.)"/><xsl:text> </xsl:text> </xsl:for-each> </xsl:template> <xsl:template match="/root"> <th>token</th><xsl:text>	</xsl:text> <th>lemma</th><xsl:text>	</xsl:text> <th>catg</th><xsl:text>	</xsl:text> <th>l's</th><xsl:text> </xsl:text> <xsl:for-each select="l"> <xsl:apply-templates> </xsl:apply-templates> </xsl:for-each> </xsl:template> </xsl:stylesheet>
question
so, don't know how 2 things:
1)how separate word punctuations
<lm lemma="to have" catg="vv">have,</lm> have have vv 1 , 1
2)how "tokenize" text outside node
<lm lemma="for" catg="nn">format</lm>, format nn 2 , 2
Comments
Post a Comment