Tuesday, May 12, 2009

Foreign Language Help

I'm currently doing research that involves online communities and multiple languages. As part of this, I'm analyzing some exceedingly popular languages (Spanish, German, Japanese...) as well as some less studied communities (Volapuk, Ukrainian, Esperanto).

The idea is that we're doing some basic text processing. To reduce the amount of time this takes and the value of the analysis, we're wanting to exclude a standard list of stop words. These are words, in English, such as in to a and the that, etc. (Examples in English, German, French) While I can find these for most European languages and have learned of other languages (Japanese, Chinese) don't really have a concept of stop words in their language.

While I've found stop word lists for most of the languages, I'm stumped on three languages: Esperanto, Volapuk, Ukrainian, and Bengali. Any insights would be appreciated.

4 comments:

Reid said...

I think Irina Shklovski is from Kazakhstan and probably either knows Ukrainian or someone who does. Do you know her? I can introduce you if not. Ivan Beskatchnikh whom you probably just met also has Eastern European connections.

Katie said...

I don't think I know Irina. Ivan was not actually at the conference, Dave McDonald was here in his place. I had some initial interest this morning from the Esperanto community on Twitter, but no word since then... I think if we can't find anyone easily then we can also process and include stopwords, since the corpii? corpuses? are so much smaller.

Tim said...

[got here via Twitter]

I speak Esperanto competently and I've got a background in computational linguistics, so hopefully I can help out.

That said, I'd never heard of "stop words" before, and, having followed the links above for other languages, it appears that the definition of "stop words" is kind of flexible, to say the least. These two lists of English stop words differ considerably; the shorter list isn't even a subset of the longer list.

I could certainly provide Esperanto equivalents for any list of English (or French) "stop words", but if you want a pre-compiled list that someone has created according to a given definition of "stop words" using some kind of statistical corpus analysis, then I'm afraid I don't have that to hand. I know a few people I could ask, but I'd like to know exactly what to ask for before approaching them.

Katie said...

Thanks Tim. I know that lists of stop words are somewhat arbitrary and differ between lists. Our main goal is to reduce processing time and we aren't as interested in words/meaning/etc as in raw statistics. I think based on your comments and the other comments I got on Twitter that I'm going to see how fast I can process Esperanto without removing stop words. If I decide that I do want a list of Esperanto equivalents, I will let you know. Thanks for your quick response and helpful ideas!