A6 - Computational lexicography - EMLex 2021 Summer Term

Teachers:

The EMLex offers a diverse spectrum of teachers & lecturers from around the globe. This course will be held by:

Prof. Dr. Stefan Evert

Friedrich Alexander University Erlangen-Nuremberg

Dr. Besim Kabashi

Friedrich Alexander University Erlangen-Nuremberg

Foundations of corpus linguistics
- principles and methods of corpus analysis
- applications of corpus data in lexicography
- types of corpora, overview of existing corpora
- corpus design, representativity, data sources, metadata
Corpus compilation
- building corpora from online data: web scraping etc.
- boilerplate removal, normalization, metadata extraction
- representation and exchange formats
- online and stand-alone tools for web corpus compilation
- automatic linguistic annotation (POS, lemma, NER, parsing, …)
- online and stand-alone tools for linguistic annotation
Searching corpora
- regular expressions
- character encodings and the Unicode standard
- CQP query language for lexico-grammatical patterns
- practical exercises with Sketch Engine and CQP web
Quantitative analysis
- frequency lists and metadata distribution
- collocations and word sketches
- keyword analysis
- lexicographic interpretation of results
- foundations of statistical inference
Reproducibility
- research methodology and documentation
- data management, sustainability of corpus resources

Please see the module description for further information.

General information:

Time frame	22.03.-26.03.21
Room	on Zoom
Evaluation method	participation in a team project with a written report (the teams will be determined at the beginning of the module)
Teaching language	German and English

Information on the EMLex 2021 Summer school:

Practical arrangements: Participants will receive a syllabus, relevant literature and suggestions on how to prepare for the course well in advance on the moodle plattform. The sessions are moderated by the lecturer and the guest lecturer. The lessons are centered around practical exercises with the computer, to be carried out in small groups (instructions will be given beforehand).

Certificate: There are two alternatives to get an EMLex 2021 Summer school certificate – (a) without grade: active participation in practical exercises, class discussions and a team project and (b) with grade: participation in a team project with a written report; the teams will be determined at the beginning of the course.

Schedule:

Time/day	Monday	Tuesday	Wednesday	Thursday	Friday
9:00-10:30	Welcome & Introduction (all)	Presentation of project ideas + corpus design with discussion (all)	Linguistic annotation & pre-processing (Heid)	Corpus search with CQP queries (Heid/Kabashi)	Final team presentations and discussion (all)
11:00-12:30	Lexicography and text corpora, Corpus design (Heid)	Collecting corpus data from the Web (Evert/Kabashi)	Representation formats, practice with SketchEngine (Evert/Kabashi)	Frequencies, collocations, and keywords (Evert)	Final team presentations and discussion (all)
Lunch break
2:00-3:30	Form teams and discuss projects (Kabashi/Evert)	Collecting corpus data from the Web	Group work on team projects (Heid)	Group work on team projects (Kabashi)
4:00-5:30	Regular expressions (Evert/Kabashi)	Schierholz (A5) 4:15 p.m.	Q&A session with instructors (Kabashi/Evert)	Q&A session with instructors (all)
6:00-7:30		Further group work as needed	Further group work as needed	Further group work as needed

A6 – Computational lexicography