Promoting CLIA workshop at COLING-2010

Add Comment | Jul 20, 2010

Are you or someone you know attending COLING-2010 in Beijing? If so, I would strongly suggest to also sign up for the  Fourth International Workshop On Cross Lingual Information Access (CLIA) taking place within the conference.

The reason why I’m promoting CLIA is because we have a paper at this workshop and we need a few more participants registered to make up the minimum 20 so we don’t get cancelled!

However, you won’t be disappointed if you attend. There are some very good quality papers from very good quality authors!

Oh and by the way, the early bird registration deadline to the main conference, workshops and tutorials has been extended to July 31st! So you can still avail of a great discount if you register soon!

Not convinced yet? The conference registration includes a FREE tour of the Great Wall of China! Plus, there are other sight-seeing activities and a banquet on offer at great competitive prices!

Register TODAY for a great conference and a great workshop you will never forget!

Main COLING website: http://www.coling-2010.org/

CLIA website: http://research.microsoft.com/en-us/UM/india/events/CLIA2010/page.htm

(registration to CLIA is done through main COLING website)

Termbase update!

Add Comment | Jan 24, 2010

Happy New Year!

In November I posted about Publicly Accessible Termbases. Since then I have added quite a few suggestions from readers (thanks Licia and Michal). Today I added some more suggestions, so if you haven’t visited that post in a while, right now it’s probably a good time to check it out. Among the latest additions we have UNTERM (The United Nations Termbase) and the FAO Termbase.

Oh and by the way, you can see all of these Termbases in my Delicious account as well! They are all tagged with “termbase”.

Publicly Accessible Termbases

2 Comments | Nov 27, 2009

Again I have been a bit slow posting new articles on the blog. And now that the Academic term is approaching its climax my workload has increased quite a lot in the last month. Hopefully I'll have a bit more time to write in the holiday period.

In spite of my workload, I decided to write a small but (hopefully) important article today about publicly accessible termbases. After the ISO published their own termbase I thought it would be pertinent to compile a list of Termbases available online. Well, here are the ones I know of. If you would like to contribute with more termbases you can do so by adding a comment to this blog post! :) asdf

Termbase

Description

Microsoft Language Portal

It allows you to search Microsoft's terminology database and software translation memories (UI translations). You can search these resources but you can't download them from this site. For a given term, Microsoft's termbase shows the source term, target term, definition and product. No other data categories are visible. If you have an MSDN Subscription, you can download the software translation memories in CSV format from the MSDN Subscription site. More info here.

Sun Globalisation Portal (Sun Gloss)

It allows you to search and download (export) Sun's terminology. Sun makes public a lot of their data categories in this tool (part of speech, status, product line, and a few others). Not many terms have definitions and it looks like the database mixes terms with UI strings. Note: you have to create an account with Sun. It's very easy, free and gives you instant access.

EuroTermBank

An EU project to provide a centralised terminology resource for new EU member states, however it contains terms in all EU languages. The database covers terms from a wide range of domains. It consolidates a diverse set of terminology resources in its database. You can get more info in this article. The tool allows you to search terms in any EU language and get their translation in another EU language. The tool displays the term, definition and an explanation note. No other data categories are exposed. They also provide a Microsoft Word add-in that allows you to integrate terminology in your documents as you write them! (I blogged about this in the past).

InterActive Terminology for Europe (IATE)

This is the EU's inter-institutional terminology database. Since it deals primarily with terminology produced by EU institutions (European Commission, Parliament, Economic & Social Committee, etc.) the domains it covers include Law, Politics, Economy, Banking, Education, etc. The IATE web portal allows you to search terms in any EU language and get their translation in another EU language. The termbase fields exposed include the source term, target term, domain, definition, reliability (approval status?), date and a couple of IDs called Term Ref. and Definition Ref.

ISO Concept Database (ISO CDB)

A highly anticipated termbase that was only released to the public in October, the ISO CDB provides terminology from all of its standards. Since ISO's standards cover practically every aspect of human endeavour, you can expect that the ISO CDB's domain coverage will be pretty comprehensive. The database seems to be monolingual only (English) and it exposes some basic fields: term, definition, part of speech and a note, along with other more ISO-specific fields: title, reference (standard the term belongs to), edition, entry no., etc. Some terms might be accompanied by a graphic symbol. I have not seen a facility to export data.

wordnik

Strictly speaking this is not a termbase. It is some sort of dictionary that shows as much information as possible for every word in English. I think this could be useful for terminology research. It also has an API which is giving me a few interesting ideas. :)

WebTerm

This is "the collection of terminology databases based on diploma theses written at the Cologne University of Applied Sciences". It covers many diverse domains. The database seems to be divided in glossaries and it looks like you can't search across glossaries. Anyway, the glossaries are organised by domain so you might not need to search across glossaries. Most glossaries are in German and English only, although I found glossaries that also supported French and Spanish. The database exposes term, definition and context. The interesting bit is that definition and context fields can also link to other terms, which is a handy and original feature in a termbase.

Terminology Management Software

A collection of different glossaries for many domains/industries. Available languages vary from glossary to glossary.

TAUS Search

This is not really a termbase, but rather a translation memory search portal. It searches translation memory software strings from many companies including Microsoft, EMC, Dell, eBay, Sun and others. I think the idea is that these translation memories are used as training data for Machine Translation and you can review them through TAUS Search and correct them if you spot an error.

TERMIUM Plus

The termbase of the Canadian government. It covers a wide range of domains and is available in English, French and Spanish. It exposes the terms in the three languages as well as definitions, context and observations in the same three languages.

National Terminology Database for Irish

A very comprehensive bilingual termbase for English and Irish developed and maintained by Fiontar in Dublin City University. Thanks to Michal Boleslav Měchura who suggested this termbase!

UNTERM: United Nations Multilingual Terminology Database The name says it all! :)
Base de Terminologie du Conseil international de la langue française Termbase from the International Council of the French Language. English, French, Spanish and German. Covers a wide range of domains.
WTOTERM: World Trade Organisation Terminology Database WTO’s Termbase. English, French and Spanish.
FAO TERMINOLOGY and FAOTERM Terminology website of the Food and Agriculture Organisation of the UN. The site links to documents, termbases, glossaries and thesauri. It also links to FAOTERM, which seems to be their main termbase. FAOTERM covers Arabic, Chinese, English, French, Italian and Spanish.
UBTerm: Base de dades terminològica de la Universitat de Barcelona The University of Barcelona’s termbase. Domains covered include teaching, research, management, etc. The teaching and research domains are actually quite broad as they include all the subjects taught in the University. Languages: Catalan, Spanish, English, French, German and Italian. 
Banque de données terminologique du Service de la langue française du Ministère de la Communauté française de Belgique Termbase of the French Language Service of the Belgian Ministry for the French-speaking Community. Covers many domains. Languages: French, English, Dutch and German.
OncoTerm: Sistema Bilingüe de Información y Recursos Oncológicos This termbase was the outcome of a research project on terminology associated with cancer (diseases, medications, treatments, and related domains). From the description, it seems that this resource hasn’t been updated since it was created (2002). Languages: English and Spanish.
EUSKALTERM: Banco Terminológico Público Vasco Termbase from the Basque Government. Covers a wide rage of domains. It actually covers all vocabularies and lexicons built by UZEI as well as other resources.  Languages: Basque, French, English, Spanish and Latin.
Cercaterm Termbase from TERMCAT, the Catalan terminology centre. If you register in Cercaterm, you can submit terminology queries to TERMCAT via the personalised service (servei d’atenció personalitzada). Cercaterm covers many domains and supports many, many languages. However, most entries seem to be availably in Catalan, Spanish, French, Italian, English and German.

 

I would like to thank Licia Corbolante for suggesting some of the termbases that appear listed here.

Drop files to several boxes in one go!

Add Comment | Sep 20, 2009

I've been very quiet lately. And it's mostly because I've been very busy with work. However I have discovered a tool that has helped improve my productivity by helping me be more in control with my files: Dropbox.

Many of us have more than one computer (e.g. a desktop and a laptop) which makes it difficult to keep track of your files. In my case I have one desktop computer in College, one desktop computer at home and I have my laptop that I bring back and forth. However, it can be quite difficult to remember on what computer I created a file, whether I made copies on the other machines and whether the machine I'm currently using has the latest version of the file. To further illustrate this issue, here I depict some other frustrating scenarios:

  • You create a spreadsheet in your work desktop one day, and days or weeks later you're working from home on your laptop and you need that spreadsheet.
  • You know that you have a certain file you urgently need in one of your machines, but you don't know on which one, so you start searching on all machines.
  • You're doing a presentation, and halfway through the presentation you realise your laptop has an old version of the slide deck, and remember that the latest version is sitting on a file share you used to collaborate with a colleague.
  • Your laptop's hard disk (which contains the only copy of the code that took you weeks to write) has just crashed... –OR– You might have a backup of the code, but more than likely it's weeks old.

Well, the solution for all of these frustrations is a little program called Dropbox. This is a tool that you install in all of your computers and keeps in synch a "dropbox" folder across all computers. In addition, Dropbox keeps a copy of this dropbox folder in the cloud (i.e. in the Internet) secured by a password.

To better illustrate Dropbox's usefulness, let's again use a scenario: let's say that you have a work desktop computer and a laptop. You install Dropbox in your work desktop computer and create an account with the Dropbox web service. You create a dropbox folder inside your My Documents folder. Then, you download and install Dropbox in your laptop and you log in to your laptop's Dropbox instance using the account you created before. In your laptop you also create a dropbox folder inside your My Documents folder.

Any file (document, spreadsheet, Access database, Visio flowchart, etc.) that you copy in either computer's dropbox folder will be automatically copied to the other computer's dropbox folder. If one of the computers is turned off, it will get synchronised the next time you turn it on and connect it to the Internet. Also, since Dropbox also keeps a copy of the files in the cloud (under your Dropbox account), you can still access these files from a third computer even if your two computers are turned off!

It's a simple idea, but I think it's great and makes it very easy to keep your files always up to date in all of your machines.

Also, if another colleague of yours also uses Dropbox, you can share a folder with your colleague, and again both you and your colleague will have a copy of all of the files you put in the shared folder in all of your computers!

Dropbox works on Windows (XP, Vista, 7) as well as Macintosh and Linux! So even if your machines are not compatible between themselves, you can still keep them in synch!

Now, since Dropbox keeps a copy in the cloud there's a limit to the amount of data you can keep in your dropbox folder. You get 2 GB for free , which I think it's enough to keep all your current work documents in synch. Once you've finished with a project, you can move it out of the dropbox folder into some sort of cold storage (an external hard disk, a DVD/CD-ROM or another external backup service like Crashplan). But if you do need more space, you can pay a fee to get additional space.

You can also get additional space for free if you do referrals (you get 250 MB for free for each referral up to a maximum of 1 GB). And actually, if you don't mind and if you're interested in signing up for Dropbox, could you signup using the referral link below so that I get some additional space for free? ;-)

Sign up to Dropbox using my referral link: https://www.getdropbox.com/referrals/NTIwNDk5OTc5

As I mentioned before, you can install Dropbox in Windows, Mac and Linux. So far, I have successfully installed it in Windows Vista and Ubuntu 9.04. The Windows Vista installation was very easy, as it always is with Vista, isn't it? :-)

However, the Ubuntu installation wasn't that straightforward. So here I give you some tips to successfully install it in Ubuntu. All of this info was collected from external sources (blogs, newsgroups, etc.) and I provide the links to these websites at the end of this post. I'm no expert in Linux, so if this doesn't work for you, please try to see the external sources provided and try to use the info supplied there. I had to make changes to some commands from the external sources as they wouldn't work in my system if I applied them verbatim, so you might need to adapt the commands to your system too. Good luck!

Installing Dropbox in Ubuntu

  1. Download the Ubuntu package from the Dropbox website and run it as suggested by the website (i.e. double-click it). The website doesn't give you more details on what to do next, so the following steps tell you what you need to do after installing the package.
  2. Open the terminal and run this command:

dropbox start –i

 

What you did in step 1 is the installation of some sort of an installer. In step 2, you run the installer that you installed and this is what downloads and installs the actual Dropbox software (the Dropbox daemon).

Follow the steps in the wizard to configure your Dropbox installation (e.g. create a new account or log into an existing account).

At this point Dropbox will be running. If it isn't, you should be able to log off and then on and Dropbox should start running. In my system, this didn't happen and I had to go to Applications -> Internet -> Dropbox every time I logged on to Ubuntu.

I did some research online and it looks like Dropbox in Ubuntu starts automatically upon login for some systems but not for all of them. I don't know why.

However, I came across a blog article that describes how to run Dropbox automatically when your system boots. The advantage of this over starting automatically upon login is that you don't have to be logged in the system for synchronisation to take place (as long as the machine is on, it will be synchronised). So if you're interested, you can try it out:

Running Dropbox in Ubuntu automatically upon booting

  1. Make sure Dropbox is installed as per my instructions above.
  2. Create a service script. Paste these commands to a text file (using a text editor):
  3. #!/bin/sh
    # dropbox service

    start() {
    echo "Starting dropbox..."
    start-stop-daemon -b -o -c user -S -x /home/user/.dropbox-dist/dropboxd
    }

    stop() {
    echo "Stopping dropbox..."
    start-stop-daemon -o -c user -n dropbox -K
    }

    case "$1" in
    start)
    start
    ;;
    stop)
    stop
    ;;
    reload|force-reload)
    stop
    start
    ;;
    restart)
    stop
    start
    ;;
    *)
    echo "Usage: /etc/init.d/dropbox {start|stop|reload|force-reload|restart}"
    exit 1
    esac

    exit 0

     

  4. Change every instance of "user" in the script to your actual Linux username.
  5. Save it in your Documents directory as "dropbox". Open a terminal window and navigate to the directory where the script file was saved. Copy the file to /etc/init.d/dropbox using root permissions. To do this, use the sudo command:

    sudo cp dropbox /etc/init.d/dropbox

     

  6. Then type these commands in the terminal:

    cd /etc/init.d
    sudo chmod 755 dropbox
    sudo drupdate-rc.d dropbox defaults
    /etc/init.d/dropbox start

     

  7. Reboot your machine and the service will start automatically.

    You can check if the service is running by typing this command:

ps -ef | grep dropbox

Sample output:

user 2340 1 0 16:21 ? 00:00:04 /home/user/.dropbox-dist/dropbox

 

You can type this command to check Dropbox's status:

dropbox status

Sample output:

Idle

 

Hopefully you'll find Dropbox useful. And remember, if you do decide to sign up and if you don't mind using my referral, please use this link: https://www.getdropbox.com/referrals/NTIwNDk5OTc5

____________________ __________ _____ ___ __ _ 

External Sources that I referred to in writing the Ubuntu instructions provided in this post:

We’re back online!

Add Comment | Aug 23, 2009

Due to a catastrophic network problem at my Web Hosting provider, my blog was down for over a week.

The service has just been restored and I'm back online. I'll continue posting more stuff shortly. Thanks for your patience.

Alfredo

A handy little terminology tool for Word

Add Comment | Aug 09, 2009

Jost Zetzsche from International Writers wrote in his newsletter about a little tool to search terminology from within Microsoft Word. To use it, simply highlight a term in Word and click CTRL+SHIFT+I. The tool will search the EuroTermBank database (as well as other termbases) and will present the results in a separate pane.

In this screenshot you can see my Microsoft Word interface highlighting the word "terminology" and searching it in the EuroTermBank database

 

Very handy! You can install this tool from here.

By the way, if you are a serious translator, localiser or terminologist, I strongly suggest you to subscribe to Jost's newsletter.

Visit this page for details:

 

In the latest newsletter issue, Jost talks about multilingual search engines, a publicly available centralised Translation Memory database and a few other interesting tools and resources. If you want to know more, subscribe to the newsletter! :)

Half way through ESSLLI 2009

2 Comments | Jul 29, 2009

The first week is done and dusted and the weekend was great fun indeed!

The ESSLLI committee organised a party and a visit to a wine chateaux and the village of Saint Émilion. The party organised by ESSLLI was great fun as it allowed us to chill in a more relaxed atmosphere and because we made it fun: we danced and chatted pretty much all night long. The visit to Saint Émilion could have been a bit longer (we only spent 1 hour there) but I might go there again in my own free time next week (I'm staying in the area for a few days after ESSLLI is finished).

This social agenda was great, but I think the best fun was in the picnics we had at the student accommodation (outside the residence buildings there's some green areas with benches). It has been very nice to relax as the evening tones down the hot weather and we socialise over fresh food and Bordeaux wine while we exchange student stories and somebody is playing the guitar (or another instrument) in the background. It was nice to make so many new friends from so many countries! J

Course-wise, definitely the lecture I enjoyed the most during the first week was Foundations of Statistics. However, I think the Mental Lexicon has provided me with a very good theoretical background to work upon in my project. Computational Psycholinguistics was very interesting indeed, and described several machine-learning methods I might be able to use in my project. But the course content itself is not very relevant to my project.

The second week is definitely looking much more promising already. The concepts from Computational Lexical Semantics and Distributional Semantic Models are exactly what I need to get my project started. They're both excellent courses and the material is very well presented.

Linguistic Information Visualisation has actually turned out to be a surprisingly interesting and useful course. The course is about using graphic visualisations (graphics, charts, diagrams, histograms, etc.) in a useful but creative way. It teaches students how to present a lot of information visually in a way that is easy to communicate. One of the lecturers is a graphic design artist who re-trained as a computer scientist (it's never too late to change a career or enhance an existing career!) And she's definitely doing a lot to improve how data is visualised in Computational Linguistics. In the course we critique graphic materials (we discuss what works and what doesn't work visually), we see how to convey information by using colour, shape, texture and other visual variables effectively, and we try to conceive new graphic formats outside the traditional histogram and piechart. It's a bit like having an art class in a mainly scientifically-oriented ESSLLI. The course is not hard, but it forces you to think creatively, so it is challenging in itself. The course is just after lunch. Lectures in the after-lunch slot usually have to compete with digiestive processes for student's attention. But the course is so entertaining that it doesn't feel like a drag.

The end of ESSLLI 2009 is quite close. But the experience has been great so far and it looks like the stuff learned, the networks built and the friendships made will stay with us for a very long time.

---

Update: Here's an exercise from the Information Visualisation class. It is an aesthetically pleasant tag cloud from my blog.

Wordle: My Blog's wordle

More about ESSLLI

Add Comment | Jul 23, 2009

I just stumbled upon a nice blog post (in French) about ESSLLI that describes very well the spirit of the Summer School. A good read if you want to understand why ESSLLI is interesting and why it attracts so many participants from every corner of the world:

http://www.paperblog.fr/2148967/pour-un-ete-studieux-esslli-2009/

A little teaser:

Une telle ferveur, cela suppose évidemment l'existence d'un champ d'investigation qui passionne, même si cela peut sembler mystérieux aux yeux du grand public. Qu'ont en effet à faire ensemble toutes ces disciplines, en quoi par exemple la linguistique, c'est-à-dire l'étude systématique des propriétés des langues humaines, a-t-elle à voir avec la façon dont fonctionnent nos ordinateurs ? Et la logique, en principe simple étude du raisonnement, que vient-elle faire dans ce panorama ?

ESSLLI2009: First impressions

Add Comment | Jul 22, 2009

ESSLLI2009 started on Monday 20th July in Bordeaux. The first three days of ESSLLI have passed very, very quickly. But there's still two-thirds of ESSLLI ahead of us!

I have been enjoying quite a lot the courses I'm taking this week:

The Mental Lexicon started very theoretical and the lecturer was presenting ideas that seemed a bit half-baked (mostly hypotheses and personal opinions). But today the course started to cover more tangible stuff relating to Lexicography. Today we covered WordNet (nothing new), but tomorrow we'll be covering FrameNet and similar stuff. I'm not planning to use these resources directly in my project, but the theory behind them and the Lexicography stuff can be useful as a good theoretical backdrop. I'm definitely looking forward to the remainder of this course!

Foundations of Statistics is just plain fun! The course covers basic probability and statistics theory (the kind of stuff you'd cover in your first year as an undergraduate). The innovative thing of the course however, is that as the lecturer explains the theory, he demonstrates each theoretical concept by doing simulations in the R language. When I took my Probability and Statistics course when I was an undergraduate, you could only take the teacher's word and believe all the theory. Later, when I took my Mathematical Statistics course, I had to prove analytically that same theory. Now, at the ESSLLI course, we put the computer to prove the theory through simulations! We're seeing the inner workings of the Central Limit Theorem, the Binomial, Normal and T Distributions, Hypothesis Testing, etc. I'm also enjoying learning a new language: the R language. It reminds me a lot of Matlab, Python and Scheme (a very nice combination!) and it looks very easy to use. Oh, and I love the lecturer's style. Very engaging, very passionate about what he's talking about and he's able to manage time and questions very well! A++++

Computational Psycholinguistics is very interesting. The course started giving an introduction to some traditional Psycholinguistic theories but yesterday it started shifting towards the computational theories. And guess what? They're probabilistic! So, even if my project won't cover any psycholinguistic aspects, I'll be able to borrow a lot from the probabilistic methods explained. And who knows, I might find a use to the psycholinguistic aspects as well. They're so fascinating by themselves anyway! By the way, can you (a human being) correctly parse (and understand) this sentence:

The complex houses married and single students and their families.

Next week I will be taking courses much more relevant to my project:

Well, the last two courses might not be that relevant, but I have to do something during those time slots.

Before that, I'll have to explore Bordeaux a bit more this weekend. I've spent most of my time indoors attending lectures and the place where we're staying (the student accommodation) is not located within particularly picturesque surroundings... Did I mention already we're sharing our rooms with cockroaches? Being a student again is definitely not boring at all!!

International Spanish

3 Comments | Jul 19, 2009

As the initial reports of the A(H1N1) flu pandemic started to plague the media, the Mexican Government initiated emergency actions in an attempt to contain the spread of the virus. As part of one of these emergency actions, the Government directed the Army to distribute free surgical masks to the population, especially in the Mexico City area. The media covered this extensively.

During that time, I happened to read about this news item in three online newspapers from three different Spanish-speaking countries: Mexico, Spain and Argentina. It struck me that each of these papers called the "surgical mask" being distributed by the Army differently: the Mexican newspaper called it "cubrebocas", the Spanish newspaper called it "mascarilla" and the Argentinean newspaper called it "barbijo".

As a native Mexican, the "cubrebocas" form was to me the most familiar and straightforward. The variant from Spain sounded correct to me but a bit out of context, since in Mexico "mascarilla" would normally be restricted to describe either the mask worn in chemical laboratories when handling products that release potentially toxic vapours, or the mask made out of cosmetic creams that is applied on the face as a beauty treatment. But the word from Argentina, "barbijo", was completely new to me as I had never seen it before! Although, upon close inspection, it is obvious that "barbijo" is derived from "barba" (beard) suggesting that the garment is to be worn on the facial area around the mouth. So even if "barbijo" is not used in other Spanish-speaking locales, it is using typical Spanish word-creation rules by recurring to existing word roots and affixes*.

The Spanish language is shared by millions of speakers around the world. The geographical distribution of its speakers has created regional linguistic differences through time. It is common for speakers from one region to dismiss the usages of other regions as foreign or simply wrong. However, notice that the words "cubrebocas", "mascarilla" and "barbijo" appear in their respective countries newspapers. (Respectable) newspapers normally use an educated and formal language register. That is, they do not use slang or uneducated variants of the language. However, this educated and formal register is still local. Therefore, "cubrebocas", "mascarilla" and "barbijo" are alternative local standard Spanish lexical representations of the same concept. So, labelling one or the other wrong is actually... well ... wrong!

We have to acknowledge however that differences like this have the potential to hinder international communication amongst language users. And in the current globalisation context and especially now through the new Web 2.0 technologies that allow us interact with people from every corner of the planet, Spanish speakers are forced to become more in contact with alternative dialects of the same language.

Wouldn't it be great to have a way of enabling Spanish speakers around the world understand each other's dialectal differences easily?

There are many dictionaries of regional usages and the Royal Spanish Academy (RAE) published just a few years ago a comprehensive Pan-Hispanic dictionary, documenting educated and formal usages from all Spanish-speaking countries. However, paper-based dictionaries have several inconvenients:

  • They tend to get outdated within a short period of time as language is always evolving
  • They are expensive to produce and sometimes to buy
  • They are selective focusing mostly on general language and might not document a usage that could suit a particular purpose or domain (e.g. a specific scientific or technical domain)
  • Even if they are Pan-Hispanic, it will be difficult to cover the particularities of every single Spanish dialect on the planet. Would a dictionary publishing house be able to hire a linguist from each of the 30+ countries where Spanish is spoken?
  • They are made of paper and therefore heavy! Language users will not have access to them all the time!

Online dictionaries, on the other hand, have the advantage that can be consulted anytime as long as there is an Internet connection. With the spread of mobile Internet devices and wifi networks, online dictionary availability is on the increase. In fact, the RAE's Pan-Hispanic Dictionary can be queried online (search under "Diccionario panhispánico de dudas").

In spite of these advantages, because online dictionaries are still managed by hand, they will still tend to become obsolete with time and somebody has to pay for their maintenance. A well co-ordinated crowdsourcing model could potentially alleviate these maladies. And that might be a good topic for another post.

In any case, I think dictionary publishers and language users in general could benefit from automated language analysis tools capable of detecting geographic synonyms like "cubrebocas", "mascarilla" and "barbijo". The development of algorithms designed to detect this type of synonyms will be one of my main research objectives.

The last ten years or so have seen a rapid development in Computational Lexical Semantics. That is, the study of the meaning of words by means of computational methods.

In 1998 Dekang Lin published a novel method to automatically build thesauri from text by analysing words in their context. As an example of the power of this approach, Lin gives the following example. Consider these sentences:

A bottle of tezgüino is on the table.
Everybody likes tezgüino.
Tezgüino makes you drunk.
We make tezgüino out of corn.

The contexts in which the word "tezgüino" is used suggest that tezgüino is some sort of alcoholic drink made from maize. In his paper, Lin shows how the information provided by these contexts can be used to build tools capable of automatically inferring that tezgüino is similar to beer, wine, vodka, etc.

A lot of improvements on these techniques have been done since 1998 and there are still many ways in which they can be further improved. But I personally think it is amazing that the application of this body of research could potentially enable us to build a Live Online Pan-Hispanic Dictionary. That is, an online dictionary that can update itself by automatically analysing how Spanish is used on the web.

The web is full of language use evidence. Most major newspapers are now published online, a lot of people have blogs, users interact with each other in social networks and discussion groups, etc. If we are able to detect the geographical origin of an online text, we can then mine for word meanings on each text and catalogue these word meanings by geographical location. We have the potential of building a global atlas of Spanish lexical variation. And as I mentioned before, we can keep this up to date by fetching new content every day and running our computational lexical semantic algorithms.

Oh, and by no means we have to stop with Spanish. We could build global atlases of lexical variation for other international languages like English, French, Portuguese and Arabic.

But my dream (at least for now) is that soon Spanish speakers from every corner of the planet will be able to understand each other more easily and learn from each other's dialects with the help of intelligent tools and devices that are capable of understanding human language meaning.

In my next post about Spanish, I will talk about the so-called "Neutral Spanish" lect. I will explain what it is, its advantages and disadvantages over regional varieties of Spanish, why multinational companies insist on it when commissioning translations into Spanish, a few tips and tricks to make sure you write in Neutral Spanish, and how technology will hopefully in the future spare us from this artefact.

- - - -

* Note about "barbijo": the suffix "-ijo" when applied to verbs means "the result of an action". When applied to nouns, this suffix normally produces a diminutive or pejorative. In this case, it is being applied to a noun (barba). So it might be used to make reference to the small size of the mask or possibly to indicate that it only covers part of the face (the mouth/beard region). If we consider allowing the meaning "the result of an action" to be applied to this noun and if we are open to a farfetched interpretation, the suffix might indicate that a person's mouth/beard gets covered as a result of wearing a surgical mask. But all of this is my own speculation. If you are familiar with this word and know more about its etymology, please leave a comment and enlighten us. J