Brigham Young University's Spanish corpus is perhaps the most efficient on all the net, because of the great variety of searches allowed in its 100 million words covering from XIII to XX centuries. Dr. Mark Davies, a PhD in Spanish Linguistics, created it in 2002, and now, five years later, he presents a new version, with new features, at http://www.corpusdelespanol.org. He kindly accepted an interview to La Página del Idioma Español about this new version.
Tell us about you and about your interest on Spanish language.
I learned Spanish when I was 19, as I served as a representative for my church, working with Hispanic immigrants in Los Angeles, California. I got a BA and MA in Spanish Linguistics from Brigham Young University, and a PhD in Spanish Linguistics from the University of Texas at Austin (1992). I then taught Spanish at Illinois State University for 12 years, and I published many articles on historical Spanish syntax and variation in Modern Spanish syntax. I then returned to BYU to teach general linguistics and corpus linguistics in 2003.
How did the idea of carrying out this project come up?
As part of my research on Spanish syntax, I had been creating and using Spanish corpora since the late 1980s. I was aware of corpora such as CREA and CORDE from the Real Academia Española, which were developed in the late 1990s. Nevertheless, I felt that it would be possible to create more useful corpora, which could be used to study a wide range of linguistic phenomena. In 2001 I applied for and received a large grant from the United State National Endowment for the Humanities (NEH) to create such a corpus – the Corpus del Español.
When has it been installed in the web?
I worked on the corpus with help from the grant from the NEH in 2001-02, and the corpus was placed online in late 2002.
How many words does this corpus include?
100 million words – 20 million from the 1200s-1400s, 40 million from the 1500s-1700s, 20 million from the 1800s, and 20 million from the 1900s (1900-1999).
How have the texts been selected?
For the 20 million words from the 1900s (Siglo XX), we have a balanced corpus of 25% spoken, 25% fiction, 25% newspapers, and 25% academic/other non-fiction. For earlier periods, there is of course not the same variety of genres/registers, but a balance was sought between fiction and non-fiction.
How many visitors each month does the corpus receive?
It typically receives between 1,500 and 2,500 distinct users each month. In the past year, nearly 20,000 different people have used the corpus, for a total of nearly 300,000 queries.
It would be also interesting to have any data about the tech issue: software, servers, etc.
The corpus is completely based on a relational database architecture that I have developed, and which has been used for a number of other large corpora (e.g. 100 million words and larger) that I have created (see http://corpus.byu.edu). This architecture allows for fast queries (typically 2-3 seconds or less) of large corpora, and – most importantly – it allows for unlimited linguistic annotation of the corpus. This means that the text can be coded for lemma (e.g. different forms of a given verb), part of speech, synonyms, and other semantic information.
What are the technical/linguistic innovations that you are introducing in the new version?
Since its creation six years ago, the corpus has allowed for a wide range of searches that are not available with any other large corpus of Spanish. This includes searches by part of speech, lemma, synonym, customized lists, and frequency in different historical periods and in different genres/registers of Modern Spanish. For example, with one simple query it would be possible to find 1) all verbs that appear for the first time in the 1900s 2) collocates (nearby words) of a given word (e.g. suave or ruido) that occur more in spoken Spanish than in written Spanish 3) collocates of a given word (e.g. mujer or valor) that occurred more in the 1800s than in the 1900s 4) the frequency of all strings for a particular syntactic construction (e.g. hacer + infinitive: les hicieron salir, nos hace pensar), 5) the most frequent synonyms of a given word, or 6) any combination of searches involving part of speech, lemma, synonyms, and frequency (e.g. any form of any synonym of hacer + an infinitive (e.g. hicieron pensar, mandó decir, obliga a trabajar) that occur more in spoken Spanish than in fiction or newspapers). Of course, these are just a few examples of an unlimited number of queries that are possible with the corpus.
The new version (available in Fall 2007) will allow a number of additional searches as well. These include: 1) collocates (nearby words) up to 10 words away (e.g. all adjectives somewhere near nube(s) or all nouns somewhere near lúgubre(s)), 2) comparisons of these collocates across different registers and time periods (to see the different meanings of words or how meanings have changed over time), 3) bar charts that summarize the frequency of all matching words, phrases, or strings of a particular grammatical construction in each time period and in each register of Modern Spanish (spoken, fiction, newspaper, academic), and 4) one-step word comparisons (e.g. what words occur with pelo but not cabello, or with empezar but not comenzar, etc). None of the searches mentioned in these two paragraphs are possible with any other corpora of Spanish, including CREA or CORDE of Real Academia Española.
Asociación Cultural Antonio de Nebrija - © 1996-2008 - Derechos Reservados / Editor: Ricardo Soca