This web site is designed for accessibility. Content is obtainable and functional to any browser or Internet device. This page's full visual experience is available in a graphical browser that supports web standards. See reasons to upgrade your browser.

CRBLP

Homepage

Research

Publications

Download

People

Internship

Student Projects

Events

Links

Contact Information

Center for Research on Bangla Language Processing
BRAC University
66, Mohakhali, Dhaka-1212
Phone: +88 (02) 8824051-4 Ext:4023
Fax: +88 (02) 8810383
crblp-staff@student.bu.ac.bd

::--Corpus Analysis & Corpus Collection--::

Name:: Shadin Bangla Corpus

Summary::

The intuition of this research is to analyze text corpus for regularities and anomalies of Bangla script. Balanced text corpus is one of the parts of corpus analysis where large text of corpus is necessary. So the research team is developing a way to collect Bangla text corpus.

Details::

This projects targeted newspaper and some old text corpus such as Ptothom-alo newspaper, Charjapad and Baru Chandi Das Er Kabbo. CRBLP team selected one year corpus of most popular newspaper “Prothom-Alo”. This newspaper corpus covers 32 items of news such as daily news, literature, economics, international, science and so on. Significant amount works involved to analyze this corpus such as text collection from web, Unicode conversion and then analysis. Several analysis criteria are word frequency list, bi-gram, tri-gram analysis, letter frequency and so on. Corpus collection is important to develop Balanced text corpus which will help us in different aspects of linguistic phenomena.

Team::

Past team:

  • Yeasir Arafat
  • Md. Zahurul Islam

Status:: Tool is available for download. [Download]

Timeline:: 2006-2008