xml - Extract substrings with XSLT -


i doing xml-> tsv converion using xslt 2.0.

data

my xml file looks this:

<?xml version="1.0" encoding="utf-8"?> <!doctype tei.2 system "lemmatizzazione.dtd"> <?xml-stylesheet type="text/xsl" href="dh.xsl"?> <root>   <l>    <lm lemma="me" catg="ss">i</lm>    <lm lemma="would" catg="vv">would</lm>    <lm lemma="like" catg="vv">like</lm>    <lm lemma="to" catg="vv">to</lm>    <lm lemma="to have" catg="vv">have,</lm>  </l>  <l>    <lm lemma="this" catg="ad">this</lm>    <lm lemma="bad" catg="e">in to</lm>    <lm lemma="a" catg="e">a</lm>    <lm lemma="ts" catg="ad">tsv</lm>    <lm lemma="for" catg="nn">format</lm>,  </l>  <l>    <lm1>     <lm lemma="but1" catg="x01">but</lm>     <lm lemma="but2" catg="x02">but</lm>    </lm1>    <lm lemma="" catg="">first</lm>    <lm lemma="" catg="">there are</lm>    <lm1>     <lm lemma="rxs" catg="fff">a</lm>     <lm lemma="tre" catg="ds">a</lm>     <lm lemma="asq" catg="rt">a</lm>    </lm1>    "<lm lemma="few" catg="e">few</lm>  </l>  <l>    <lm>problems</lm>    <lm>to</lm>    <lm>solve.</lm>    <lm>here</lm>    <lm>and</lm>  </l>  <l>    <lm lemma="there" catg="e">there</lm>.    <lm lemma="bad" catg="ee">this is</lm>    <lm lemma="bad2" catg="ee">multi word</lm>    <lm lemma="good" catg="nn">expression.</lm>    </l> </root>  

desired output

the desired output following:

token    lemma    catg    l's        me       ss      1       vv      1         vv      1             vv      1 have     have  vv      1 ,                         1 ... format        nn      2 ,                         2      but1but2 x01x02  3 

my code

my actual xslt this:

<?xml version="1.0"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/xsl/transform">   <xsl:strip-space elements="*"/>   <xsl:output method="text" encoding="utf-8"/>    <xsl:template match="lm"> <xsl:choose>   <xsl:when test="contains(.,' ')">     <xsl:call-template name="tokenize">       <xsl:with-param name="mwe" select="current()"/>     </xsl:call-template>   </xsl:when>   <xsl:otherwise>     <xsl:choose>       <xsl:when test="matches(substring(.,string-length(.)),'\.|,|;')">     <xsl:call-template name="remove_punctuation">       <xsl:with-param name="token" select="."/>       <xsl:with-param name="mark" select="substring(.,string-length(.))"/>     </xsl:call-template>       </xsl:when>       <xsl:otherwise>     <xsl:value-of select="." />     <xsl:call-template name="ids"/>       </xsl:otherwise>     </xsl:choose>   </xsl:otherwise> </xsl:choose>   </xsl:template>    <xsl:template match="lm1"> <xsl:value-of select="lm" /> <xsl:call-template name="multi_ids" />   </xsl:template>     <xsl:template name="ids"> <xsl:text>&#9;</xsl:text> <xsl:value-of select="./@lemma"/><xsl:text>&#9;</xsl:text> <xsl:value-of select="./@catg"/><xsl:text>&#9;</xsl:text> <xsl:value-of select="count(../preceding-sibling::l) + 1"/><xsl:text>&#10;</xsl:text>   </xsl:template>    <xsl:template name="multi_ids"> <xsl:text>&#9;</xsl:text> <xsl:for-each select="lm">   <xsl:value-of select="./@lemma"/> </xsl:for-each><xsl:text>&#9;</xsl:text> <xsl:for-each select="lm">   <xsl:value-of select="./@catg"/> </xsl:for-each><xsl:text>&#9;</xsl:text> <xsl:value-of select="count(../preceding-sibling::l) + 1"/><xsl:text>&#10;</xsl:text>   </xsl:template>    <xsl:template name="tokenize"> <xsl:param name="mwe"/> <xsl:for-each select="tokenize($mwe/.,'\s+')">   <xsl:value-of select="normalize-space(.)"/><xsl:text>&#9;</xsl:text>   <xsl:value-of select="$mwe/./@lemma"/><xsl:text>&#9;</xsl:text>   <xsl:value-of select="$mwe/./@catg"/><xsl:text>&#9;</xsl:text>   <xsl:value-of select="count($mwe/../preceding-sibling::l) + 1"/><xsl:text>&#10;</xsl:text> </xsl:for-each>   </xsl:template>    <xsl:template name="remove_punctuation"> <xsl:param name="token"/> <xsl:param name="mark"/> <xsl:for-each select="tokenize($token/.,$mark)">   <xsl:value-of select="normalize-space(.)"/><xsl:text>&#10;</xsl:text> </xsl:for-each>       </xsl:template>     <xsl:template match="/root">     <th>token</th><xsl:text>&#9;</xsl:text>     <th>lemma</th><xsl:text>&#9;</xsl:text>     <th>catg</th><xsl:text>&#9;</xsl:text>     <th>l's</th><xsl:text>&#10;</xsl:text>       <xsl:for-each select="l">     <xsl:apply-templates>     </xsl:apply-templates>       </xsl:for-each>   </xsl:template>  </xsl:stylesheet> 

question

so, don't know how 2 things:

1)how separate word punctuations

<lm lemma="to have" catg="vv">have,</lm>  have     have  vv      1 ,                         1 

2)how "tokenize" text outside node

<lm lemma="for" catg="nn">format</lm>,  format        nn      2 ,                         2 


Comments

Popular posts from this blog

java - Date formats difference between yyyy-MM-dd'T'HH:mm:ss and yyyy-MM-dd'T'HH:mm:ssXXX -

Fatal error: Call to undefined function menu_execute_active_handler() in drupal 7.9 -

python - RuntimeWarning: PyOS_InputHook is not available for interactive use of PyGTK -