The EF-Cambridge Open Language Database (EFCAMDAT) is the largest open-access corpus of English learner essays. It comprises submissions from students worldwide who attend an online EF school. Learners are assigned to proficiency levels based on their initial placement test results or through successful course progression. The 16 proficiency levels, aligned with the Common European Framework of Reference for Languages (CEFR), each consist of eight lessons designed to enhance reading, listening, speaking, and writing skills. EFCAMDAT includes scripts from writing tasks at the end of each lesson, covering topics like "writing a resume" and "giving budgeting advice."
In its first release, the corpus contained 551,036 scripts from 84,864 learners. The second release expanded to 1,180,310 texts from 174,743 learners. A cleaned subcorpus was also created, containing only texts from levels 1 to 15 by learners from the 11 most represented nationalities.
Academic researchers can request access to the second release of the corpus (in XML format), the cleaned subcorpus (in XLSX format), and the list of task prompts through our Google Drive.
User agreement
Use the link below to download the user agreement as a PDF file.
User Agreement (PDF file)
Request access
Follow the link below to submit an application to access the corpus. Please note that an academic affiliation and access to Google Drive are necessary to use the corpus. Thus, you need to authenticate with your university email with a Google account to access the corpus request form.
Corpus Access Request Form
If you need to set up a Google account with your academic email address, you may refer to the instructions HERE (Check the section titled "Can I use an existing email address?").
Download corpus
Follow the link below to download the EFCAMDAT Corpus files. Note that your application (above) will need to be approved by administrators before you can access the Google Drive. In the unlikely event that you think your request has been missed, please resbumit the Corpus Access Request Form.
Corpus Files (Google Drive)No longer using the corpus?
Follow the link below to submit a request for the administrator to remove your data.
Corpus Withdrawal Request Form
Get in touch
If you have any difficulty accessing the corpus or have any questions, please email the EFCAMDAT corpus administrator Rory Leung.
Please cite the following when using the EFCAMDAT data:
Geertzen, J., Alexopoulou, T., & Korhonen, A. (2014). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCamDat). In R.T. Millar, K.I. Martin, C.M. Eddington, A. Henery, N.M. Miguel, & A. Tseng (Eds.), Selected proceedings of the 2012 Second Language Research Forum (pp. 240–254). Somerville, MA: Cascadilla Proceedings Project.
Huang, Y., Geertzen, J., Baker, R., Korhonen, A., & Alexopoulou, T. (2017). The EF Cambridge Open Language Database (EFCAMDAT): Information for users (pp. 1–18). Retrieved from https://ef-lab.mmll.cam.ac.uk/EFCAMDAT.html
Please cite following if you are using the cleaned sub-corpus:
Shatz, I. (2020). Refining and modifying the EFCAMDAT: Lessons from creating a new corpus from an existing large-scale English learner language database. International Journal of Learner Corpus Research, 6(2), 220-236. doi:10.1075/ijlcr.20009.sha
Please cite following if you are using the cleaned, parts-of-speech-tagged and error-coded sub-corpus:
Öksüz, D., Derkach, K., & Alexopoulou, T. Tsimpli, I. M. (under review). The influence of L1 typology on the acquisition of the L2 English article: A large-scale corpus study. Second Language Research.