Multiword expressions are mapped to identifiers using finite-state
networks. Each of a plurality of multiword expressions is encoded into a
regular expression. Each regular expression encodes a base form common to
a plurality of derivative forms defined by ones of the multiword
expressions. Each of the plurality of regular expressions is compiled
with factorization into a set of finite-state networks. A union of the
finite-state networks in the set of finite-state networks is performed to
define a multiword finite-state network and a set of subnets. The
multiword finite-state network and the set of subnets are traversed to
identify a path corresponding to one of the plurality of multiword
expressions, wherein only transitions originating from the multiword
finite-state network are accounted for to ascertain a path number
identifying a base form of the one of the plurality of multiword
expressions.