UU Database User's Manual

1 Overview

File Listings

		  /00README.txt           README
		  /Sessions               Speech data and XML documents for each session
		  /Sessions/C001          Data for Session C001
		  /Sessions/C001/C001.sd  Speech data for Session C001 (2 channels, ESPS format)
		  /Sessions/C001/C001.wav Speech data for Session C001 (2 channels, Microsoft WAVE format)
		  /Sessions/C001/C001.xml XML document for Session C001
		  /Sessions/C002          Data for Session C002
		  /Sessions/C003          Data for Session C003
		  /Sessions/C004          Data for Session C004
		  /Sessions/C005          Data for Session C005
		  /Sessions/C006          Data for Session C006
		  /Sessions/C007          Data for Session C007
		  /Sessions/C011          Data for Session C011
		  /Sessions/C012          Data for Session C012
		  /Sessions/C013          Data for Session C013
		  /Sessions/C021          Data for Session C021
		  /Sessions/C022          Data for Session C022
		  /Sessions/C023          Data for Session C023
		  /Sessions/C024          Data for Session C024
		  /Sessions/C031          Data for Session C031
		  /Sessions/C032          Data for Session C032
		  /Sessions/C033          Data for Session C033
		  /Sessions/C041          Data for Session C041
		  /Sessions/C042          Data for Session C042
		  /Sessions/C043          Data for Session C043
		  /Sessions/C051          Data for Session C051
		  /Sessions/C052          Data for Session C052
		  /Sessions/C053          Data for Session C053
		  /Sessions/C061          Data for Session C061
		  /Sessions/C062          Data for Session C062
		  /Sessions/C063          Data for Session C063
		  /Sessions/C064          Data for Session C064
		  /Sessions/speakers.xml  Speaker information file
		  /doc                    Documents
		  /doc/manual.pdf         UUDB User's Manual (Japanese)
		  /doc/resp.mat           Filter coefficients for downsampling
		  /tools                  XML-related utilities
		  /var                    Derivative data (Recoverable from speech data and XML documents)
		  /var/C001               Derivative data for Session C001
		  /var/C001/C001.list     Per-utterance speech file name list for Session C001 (in order)
		  /var/C001/C001.txt      Orthographic transcription file
		  /var/C001/C001.syl      Syllabic transcription file
		  /var/C001/C001.para     Averaged rating file of paralinguistic information
		  /var/C001/C001L_001.wav Speech file of 1st utterance by speaker on channel L
		  /var/C001/C001L_002.wav Speech file of 2nd utterance by speaker on channel L
		  /var/C001/C001R_001.wav Speech file of 1st utterance by speaker on channel R
		  /var/C001/C001R_002.wav Speech file of 2nd utterance by speaker on channel R

All the information that the UU Database provides is integrated into speech data under the directory /Sessions, and XML documents.

Session ID

In the UU Database, a series of utterance in a discourse is treated as a unit called session. Each session corresponds to a series of utterance for a given 4-frame cartoon.

Each session is identified by a 4-character ID starts with "C".

The character following to "C" is always "0" in this release.

The next character represents a serial number for participant pair. For example, "3" is given to the pair of speaker FKC and FUE.

The following character is a session number for the pair.

Thus, for example, the Session ID C031 indicates the first session by the pair of speaker FKC and FUE.

Speech Data

Speech data for each session are stored in the directory identified by its Session ID under the directory /Sessions, in both ESPS format and Microsoft WAVE format. For example, speech data for the session C031 are /Sessions/C031/C031.sd (ESPS format) and /Sessions/C031/C031.wav (WAVE format). Both speech data are two-channel speech of the whole span of the session, and identical except the latter lack Session Start Time information. This redundancy comes from the requirement of ESPS (not available any longer) for cross conversion.

XML Document

The XML document for each session is stored as a UUDB XML Document. For example, the XML document for the session C031 is /Sessions/C031/C031.xml.

Derivative Data

For users who are busy or not so familiar with XML, various types of ready-to-use data generated from the speech data and the XML documents are provided as derivative data, which can be found under the directory /var. The derivative data can be recovered with accompanied utilities described in Using XML Utilities section.

For taking a quick glance at the database, transcription files can be a good start point. Two kinds of transcription files are provided; both are in the Shift JIS encoding. Although the encoding is very popular for Japanese users, other users might find it difficult to handle the files. Currently they have several options (requires an XSLT processor):

  • Generate transcription files in the UTF-8 encoding, by modifying /tools/xml2txt.xsl.
  • Generate romanized phonetic transcription files, by modifying /tools/xml2syl.xsl.

The next release of UUDB will supply utilities for non-Japanese environment support.

Using XML Utilities

The directory /tools contains the files listed below.

		  /tools/SplitUtterance    Source files of the utterance splitting utility
		  /tools/SplitUtterance.sh Shell script for batch execution of the utterance splitting utility
		  /tools/xml2txt.xsl       Stylesheet for generating orthographic transcription files (EUC encoding)
		  /tools/xml2txt-sjis.xsl  Stylesheet for generating orthographic transcription files (Shift JIS encoding)
		  /tools/xml2syl.xsl       Stylesheet for generating phonetic transcription files (EUC encoding)
		  /tools/xml2syl-sjis.xsl  Stylesheet for generating phonetic transcription files (Shift JIS encoding)
		  /tools/transcription.sh  Shell script for batch execution of transcription file generation
		  /tools/xml2list.xsl      Stylesheet for generating per-utterance speech file name lists
		  /tools/list.sh           Shell script for batch execution of the per-utterance speech file name list generation
		  /tools/xml2para.xsl      Stylesheet for generating averaged rating files of paralinguistic information
		  /tools/para.sh           Shell script for batch execution of averaged rating file generation
		  /tools/uudb.rng          RELAX NG schama of the UUDB XML Document

To run the utterance splitting utility, the Java SE development environment is required.

2 Design and Building of the UU Database

The Four-Frame Cartoon Sorting Task

(This section is a stub.)

Dialogue Recording

(This section is a stub.)

Identifying the Unit of Utterance

(This section is a stub.)


(This section is a stub.)

Paralinguistic Information Annotation

(This section is a stub.)

3 Structure of the UU Database

Elements of the UUDB XML Document

In the UU Database, each individual discourse is treated as a session. An XML document of the UU Database is a single file for a session, and its root element is Session.

Attributes to Session


Session ID of the session for which this XML document describes.


The date of recording.


ID of the speaker whose speech is recorded in the first channel (normally the left channel) for this session.


ID of the speaker whose speech is recorded in the second channel (normally the right channel) for this session.


The cartoon material number used for this session.


Enumerated frame numbers that the speaker with the ID LSpeakerID has. For example, if a speaker has the (originally) fourth and second frames, the attribute value is "4,2".


Enumerated frame numbers that the speaker with the ID RSpeakerID has.


True if this session was performed under a time constraint.


The time at which this session started (in seconds). Identical to the start time recorded in the header of this session's speech data in the ESPS format. All sorts of start/end time indicated in the XML documents (UtteranceStartTime, UtteranceEndTime, StartTime, EndTime) regard this time as origin. For example, if SessionStartTime is 2.0s, UtteranceStartTime is 10.1s and UtteranceEndTime is 11.1s, the utterance started at 8.1s after the beginning of this session, and ended at 9.1s after the beginning of the session.

SessionComment element

Comments for this session. Just for developmental use. Users should not rely on this information.

Comment element

A comment.

Attributes to Comment

Comment strings.

Utterance element

An utterance. The maximum unit for description in the UU database.

Each session is composed of a sequence of utterances. Utterances are arranged in the progressing order according to the start time.

Attributes to Utterance


ID of this utterance. Serial number where the first utterance of each session is "001".


Indicates the speaker of this utterance. If "L", this utterance is of the speaker with the ID LSpeakerID. If "R", this utterance is of the speaker with the ID RSpeakerID.


The start time of this utterance. cf. SessionStartTime


The end time of this utterance. cf. SessionStartTime


A keyword that designates whether the end of this utterance agrees with the end of slash unit. Default value is "complete".

If "complete", the end of this utterance is the end of slash unit.

If "incomplete", the end of this utterance is not the end of slash unit, and the slash unit continues to the next utterance.

If "irrelevant", this utterance is not involved with identifying slash units. Applies in the case where the whole utterance is composed of nonlinguistic sounds or short fragments that can hardly be a slash unit.

UtteranceComment element

Comments for this utterance. Just for developmental use. Users should not rely on this information.

Comment element

A comment.

Attributes to Comment

Comment strings.

EmotionalState element

A set of paralinguistic information annotations for this utterance.

Rating element

The perceived emotional states of the speaker that an annotator evaluated for this utterance on a 7-point scale for six abstract dimensions.

Attributes to Rating

The annotator ID.


The rating for the "pleasant-unpleasant" dimension. Evaluation of speaker's feeling.

  1. Extremely unpleasant
  2. Very unpleasant
  3. Somewhat unpleasant
  4. Neutral
  5. Somewhat pleasant
  6. Very pleasant
  7. Extremely pleasant

The rating for the "aroused-sleepy" dimension. Evaluation of the speaker's mental activity.

  1. Extremely sleepy
  2. Very sleepy
  3. Somewhat sleepy
  4. Neutral
  5. Somewhat aroused
  6. Very aroused
  7. Extremely aroused

The rating for the "dominant-submissive" dimension. Evaluation of the degree at which the speaker leads the communication to the another party.

  1. Extremely submissive
  2. Very submissive
  3. Somewhat submissive
  4. Neutral
  5. Somewhat dominant
  6. Very dominant
  7. Extremely dominant

The rating for the "credible-doubtful" dimension. Evaluation of the degree at which the speaker believes the another party.

  1. Extremely doubtful
  2. Very doubtful
  3. Somewhat doubtful
  4. Neutral
  5. Somewhat credible
  6. Very credible
  7. Extremely credible

The rating for the "interested-indifferent" dimension. Evaluation of the degree at which the speaker is interested in the another party or her/his utterance.

  1. Extremely indifferent
  2. Very indifferent
  3. Somewhat indifferent
  4. Neutral
  5. Somewhat interested
  6. Very interested
  7. Extremely interested

The rating for the "positive-negative" dimension. Evaluation of the degree at which the speaker evaluates the another party's utterance positively.

  1. Extremely negative
  2. Very negative
  3. Somewhat negative
  4. Neutral
  5. Somewhat positive
  6. Very positive
  7. Extremely positive

Child elements of Utterance

Each utterance is a sequence whose constituents are either "nonlinguistic sound", "short pause" or "chunk". These elements are contiguous and not overlapping. Therefore, the speech duration of an utterance can be calculated by subtracting the total sum of the duration of nonlinguistic sound and short pause within the utterance from the duration of the utterance (= UtteranceEndTimeUtteranceStartTime).

NonLinguisticSound element

Nonlinguistic sound. Originated from the speaker. Does not occur simultaneously with speech sound.

Unlike speech sounds (chunks), identification of nonlinguistic sound is not comprehensive.

Attributes to NonLinguisticSound


Constituent ID of this nonlinguistic sound. Serial number where the first constituent of each utterance is "1".


The start time of this nonlinguistic sound. cf. SessionStartTime


The end time of this nonlinguistic sound. cf. SessionStartTime


True if this nonlinguistic sound is a breathing sound.


True if this nonlinguistic sound is a laughing sound.

Applies to pure laughing sound; speech portions with laughing are irrelevant.


True if this nonlinguistic sound is a sigh.


True if this nonlinguistic sound is a cough or throat clearing.

ShortPause element

Short pause within utterance.

Attributes to ShortPause


Constituent ID of this short pause. Serial number where the first constituent of each utterance is "1".

The start time of this short pause. cf. SessionStartTime


The end time of this short pause. cf. SessionStartTime

Chunk element

A stretch of speech sound. A speech continuum not being divided by nonlinguistic sounds, short pauses, or element boundaries.

Attributes to Chunk


Constituent ID of this chunk. Serial number where the first constituent of each utterance is "1".


Orthographic transcription for this chunk.


Phonetic transcription for this chunk by katakana, which can directly be transformed to a phoneme sequence.


True if this chunk is a slip of the tongue (mispronunciation) or a repetition. Some disfluent chunks are followed by a self repair; others are not.


True if this chunk is a filler.


True if this chunk is a backchannel (aizuchi).


True if this chunk is a conjunction.


True if this chunk is a discourse marker.


(experimental) True if this chunk is marked as S (something like shout or surprise).


True if the slash unit ends prematurely at this chunk.

Mora element

A mora. A chunk is composed of a sequence of morae.

Attributes to Mora

ID of this mora. Serial number where the first mora of each chunk is "1".


A symbol of this mora. Written in katakana, just the same way as PhoneticTranscription.

ExternalNoise element

A set of external noise occurred during this utterance.

Noise element

An external noise. (not complehensive, not objective)

Attributes to ExternalNoise

ID of this external noise. Serial number where the first external noise appeared in each utterance is "1".


The start time of this external noise. cf. SessionStartTime


The end time of this external noise. cf. SessionStartTime

Structure of the speaker information file

The speaker information file is an XML document, whose root element is Speakers.

Speaker element

A speaker.

SpeakerInfo element

Personal information for this speaker.

Attributes to SpeakerInfo

Speaker ID of this speaker.


This speaker's age as of the recording date.


Gender of this speaker. "F" means female. "M" means male.

ResidentialHistory element

The history of residence of this speaker in retroactive order.

Sessions element

A set of sessions in which this speaker participated.

SessionInfo element

Information of a session.

Attributes to SessionInfo

Session ID of this session.


Indicates the channel in which this speaker's speech is recorded for this session.


The speaker ID of the speaker with whom this speaker is talking.