From text to phonological pronunciation

Orthographic text in many languages does not encode the precise pronunciation of the corresponding spoken utterance. It is there useful to be able to automatically transform a text into a phonological encoding (e.g. for speech synthesis). The CLARIN web service G2P provides such a tool for a multitude of languages. This use case solves the task: given a Hungarian text, what is the most likely phonological pronunciation?

Especially relevant for

  • Linguists
  • Phoneticians
  • Phonologists
  • Speech technology

Starting point:

an orthographic transcript (*.txt)

Task:

an canonical transcription of the input encoded in SAM-PA and stored in TCF suitable for Weblicht processing

Solution:

BAS G2P web interface or calling the BAS G2P web service from the command line

Related CLARIN-D tools and services

Short guide on how to use BAS G2P

Preparation:

  1. download the ZIP package ftp://clarin.phonetik.uni-muenchen.de/BASWebServices/useCases/hungarianG2P.zip
  2. un-pack it onto your local desktop folder; there should be a directory called 'hungarianG2P' on your desktop.

By Web Interface:

  1. start Chrome or Firefox and goto http://clarin.phonetik.uni-muenchen.de/BASWebServices
  2. select service 'G2P'
  3. drag&drop the file 01_1.txt from the directory 'hungarianG2P' onto the designated drop area
  4. press button 'Upload' you can now inspect the uploaded text file by clicking on the file link in the drop area
  5. execute the grapheme-to-phoneme conversion with the following options Options:
    • Language = Hungarian
    • Output format = tcf
    • confirm the terms-of-usage and press the button 'Run Web Service'
  6. after a few seconds a link to the result file '01_1.g2p.tcf' is shown below; depending on your browser you can click on it to inspect the result (or download it with a rightclick, selecting "Save link as"); in the result XML file words are tokenized and the corresponding individual phonetic symbols are separated by a blank.
  7. More about the SAM-PA phonetic encoding can be found in: http://www.phon.ucl.ac.uk/home/sampa/

By Webservice (for example on Linux/Unix system):

  1. start a terminal and go to the directory hungarianG2P on your desktop, e.g.
    cd /home/user/Desktop/hungarianG2P
  2. execute the following curl call:
      curl -v -X POST -H 'content-type: multipart/form-data' -Fi=@01_1.txt -F iform=txt -F oform=tcf 
      -F lng=hun-HU 'https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runG2P'
  3. The response should be something like: <WebServiceResponseLink><success>true</success><downloadLink>https://clarin.phonetik.uni-muenchen.de:443/BASWebServices/data/2015.12.08_11.09.12_97D67C6035DBF0E705891B0E44756CBE/01_1.g2p.tcf</downloadLink><output></output><warnings></warnings></WebServiceResponseLink>
  4. copy&paste the URL in the <downloadlink> tag to your web browser and you will see the result, or
  5. download the result file based on the link in "downloadLink" with wget:
     wget https://clarin.phonetik.uni-muenchen.de/BASWebServices/data/[...] -O 01_1.tcf