Standards & Specifications
The Eagles Guidelines
The Eagles Guidelines provide guidance for markup to be used with text corpora, particularly for identifying features relevant in computational linguistics and lexicography.
TEI P5: Guidelines for Electronic Text Encoding and Interchange
The TEI has served for many years as a mature annotation format for corpora of different types, including linguistically annotated data.
Leipzig Glossing Rules
The rules cover a large part of linguists' needs in glossing texts, but most authors will feel the need to add (or modify) certain conventions (especially category labels). For Georgian Version see, ლაიფციგის გლოსირების წესები.
Eurotyp Guidelines
The Eurotyp Guidelines provides program in Language Typology developed by the Committee on Computation and Standardization.
MULTEXT-East Morphosyntactic Specifications
A multilingual dataset for language engineering research and development, focused on the morphosyntactic level of linguistic description.
The Digital Humanities Manifesto
Just a vision on the future of Humanities.
Corpus of Mingrelian Language
Under Construction.
Language Processing Tools
IPA Help 2.1
A useful, simple tool for learning to recognize, transcribe, and produce the sounds of the International Phonetic Alphabet (IPA).
Speech Analyzer
Speech Analyzer facilitates acoustic analysis of speech sounds.
Praat: doing phonetics by computer
Praat is a free computer software package for the scientific analysis of speech in phonetics.
Field Linguist's Toolbox
Toolbox is a data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text.
ELAN
ELAN is a professional tool for the creation of complex annotations on video and audio resources.
Gephi
Gephi is the leading visualization and exploration software for all kinds of graphs and networks and it can be easily adopted for the visualization of linguistic data.
PC-KIMMO
The program is designed to generate (produce) and/or recognize (parse) words using a two-level model of word structure in which a word is represented as a correspondence between its lexical level form and its surface level form.
FSM
This tool is a practical guide to finite-state theory and to the use of the Xerox finite-state programming languages LexC and xfst.
AntConc etc.
A freeware corpus analysis toolkit for concordancing and text analysis and a lot of other tools.
General Linguistics Websites
The Linguist List
The Linguist List is dedicated to providing information on language and language analysis.
SIL Organization
SIL serves language communities worldwide, building their capacity for sustainable language development, by means of research, translation, training and materials development.
The World Atlas of Language Structures Online
The World Atlas of Language Structures (WALS) is a large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars).
The Language Archive
The Data Archive at the Max Planck Institute for Psycholinguistics is storing a lot of unique material, from a large variety of languages worldwide, which is recorded and analyzed by researchers from different linguistic disciplines.
Endangered Languages
DOBES
The DOBES Archive contains language documentation data from a great variety of languages from around the world that are in danger of becoming extinct.
Endangered Languages Database
The Endangered Languages Database project includes a database of language endangerment levels with references to collections and recordings of oral literature that exist in archives around the world.
UNESCO Atlas of the World's Languages in Danger
The online edition of the Atlas of the World's Languages in Danger is a tool to monitor the status of endangered languages and the trends in linguistic diversity at the global level.
Georgian Corpora
The GNC Project
The Georgian National Corpus is a comprehensive corpus of the Georgian language covering all stages of its historical development.
The Georgian Language Corpus
The Georgian Language Corpus (GLC) is a corpus comprising texts written in Old, Middle and Modern Georgian Language and equipped with additional features for their analysis.
Linguistic Portrait of Georgia
The Linguistic Portrait of Georgia is a database aiming at representing the results of different projects on Georgian Dialects.
Wardrops' Collection Online
The Wardrops' Collection Online (WCO) is a digital repository and research project devoted to the Wardrops' Collection of Georgian manuscripts preserved at the Bodleian Library.
Corpora Worldwide
Open Language Archives Community (OLAC)
OLAC is part of a larger community known as the Open Archives Initiative. The OAI develops and promotes interoperability standards for digital archives, and currently spans dozens of archives and a total of over a million records.
The Child Language Data Exchange System (CHILDES)
CHILDES is the child language component of the TalkBank.