Vector Space Models and Semantic Analysis

Dr Simon Musgrave1, Dr Alice Gaby1, Mr Gede Primahadi Wijaya Rajeg1

1Monash University, Melbourne, Australia,,



Distributional semantic analysis is based on the idea that words which occur in the same contexts tend to have similar meanings, encapsulated by J.R.Firth: “a word is characterized by the company it keeps” [1, p. 57]. One way of implementing such an approach is to use space vector models [2], [3] of word meaning. Such models represent text as a matrix which locates each word in a multi-dimensional space; words which are used in similar contexts are close to each other in the spatial model and words which rarely co-occur in the text are far apart. Given a sufficiently large text sample, a model can be constructed which approximates the Saussurean ideal of showing the differences between every lexical element of a language. Implementations of algorithms to produce such models are now easily available [4], [5]. In this paper, we present initial results of semantic analysis using space vector models. This case study took 22 verbs used to describe events of cutting and breaking as identified by [6], [7].


A 20 dimensional model was built using the entire contents (more than 500 million words) of the Corpus of Contemporary American English [8]-. A vector matrix for the 22 cut/break words to was extracted from the model. The matrix was then the basis for hierarchical clustering analysis, resulting in the dendrogram in Figure 1 which shows seven clusters as the most parsimonious grouping of the data.

Figure 1 – Hierarchical cluster analysis of 22 verbs of cutting and breaking


We suggest that the dendrogram shows two aspects of the value of these methods in semantic analysis.

Firstly, the clustering reflects semantic intuitions in most cases. For example, the first split in the clustering contrasts words which can be viewed as more basic, such as cut and break themselves, against more specific words, such as slash and hack, which are hyponyms of the first group. As another example at a lower level of the clustering, the somewhat archaic words hew and cleave group together. However, there are anomalies in the clustering: for example, saw is in the first main group discussed, and scythe does not group with hew and cleave.

Secondly, the lowest level of clustering show us the words which are closest to each other in the model, allowing us to ask what conceptual differences are relevant in distinguishing these words. An interesting example of this is the group slice, peel and chop. Intuition might suggest that slice and chop would be close to each other with peel denoting a rather different type of cutting. But in the model, peel and chop are closest with slice grouping together with them at the next level in the hierarchy.

The anomalies in these results suggest that the next step in applying these methods is to use them in association with collocational analysis. The space vector model is built from co-occurrence of words, therefore a phenomenon such as the relation seen here between peel and chop may be based on a commonality in what entities the activity is applied to rather than intrinsic properties of the activity. This suggestion is confirmed by Figure 2, which represents a network analysis of the co-occurrence patterns of the 20 words closest to each of the target verbs. The cluster in the upper right of this figure consists of words all used in recipes and this suggests that this genre may be over-represented in the data source.


The studies from which we drew inspiration [6], [7] make comparisons across languages and we are extending our research in this direction, initially to include data from Dutch, German and Swedish (as in [7]).


[1]           J. R. Firth, “A synopsis of linguistic theory 1930-1955,” in Selected Papers of J.R. Firth 1952-1959, F. R. Palmer, Ed. London: Longman, 1968, pp. 168–205.

[2]           S. Clark, “Vector Space Models of Lexical Meaning,” in The Handbook of Contemporary semantic theory, Second Edition., S. Lappin and C. Fox, Eds. Hoboken: John Wiley & Sons, 2015, pp. 493–522.

[3]           P. D. Turney and P. Pantel, “From frequency to meaning: Vector space models of semantics,” J. Artif. Intell. Res., vol. 37, no. 1, pp. 141–188, 2010.

[4]           T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” ArXiv Prepr. ArXiv13013781, 2013.

[5]           R. Řehůřek, “Scalability of Semantic Analysis in Natural Language Processing,” phdthesis, Masaryk University, 2011.

[6]           A. Majid, J. S. Boster, and M. Bowerman, “The cross-linguistic categorization of everyday events: A study of cutting and breaking,” Cognition, vol. 109, no. 2, pp. 235–250, 2008.

[7]           A. Majid, M. Gullberg, M. van Staden, and M. Bowerman, “How similar are semantic categories in closely related languages? A comparison of cutting and breaking in four Germanic languages,” Cogn. Linguist., vol. 18, no. 2, Jan. 2007.

[8]           M. Davies, “The Corpus of Contemporary American English: 520 million words, 1990-present.” 2008.



Simon Musgrave is a lecturer in linguistics at Monash University who locates much of his work in recent years in the field of Digital Humanities. This continues a longstanding interest in the use of computational tools for linguistic research.  Simon is a member of the executive of the Australasian Association for Digital Humanities and of the management committee of the Australian National Corpus.

Recent Comments