Legislation Releases Linguistic Treasure Trove to Penn Researchers
Long blocked by law from using Voice of America and Radio Marti recordings, linguistic researchers at Penn are hoping the Cold War weapons of words will soon become a rich data base for shaping the future of computing and advancing language instruction.
"There's no source of linguistic information quite like it," said Michael Lenker, managing director of Penn's Linguistic Data Consortium (LDC). "It's a treasure of material that couldn't be used until now."
Since its founding in 1992, LDC has been in the business of collecting and distributing computer readable text and voice recordings and making them available for research on human languages. The linguistic data is used by university researchers, private industry and government agencies working to develop technologies ranging from computerized language translation to better language teaching methods.
Until last September, however, researchers were forbidden access to U.S. Information Agency materials under Federal laws that date from the end of World War II. The USIA, which has jurisdiction over VOA and Radio Marti, is prohibited from distributing its program materials, including transcripts, within the United States.
The aim of the prohibition is to limit the power of any administration to use the agency or its resources to influence domestic public opinion for its own purposes.
Because in the past, Congress had granted narrowly tailored exemptions to these rules, the LDC began negotiations with USIA to find the common ground that would set the stage for action by Congress to open these resources to researchers.
LDC staff, especially Rebecca Finch, who spearheaded the talks with USIA, worked closely with Penn's office of Policy Planning and Federal Relations to garner the necessary support to pass the legislative exemption.
Penn turned to Rep. Benjamin Gilman (W '46), a Republican from New York, who is the chairman of the House International Relations Committee, which has jurisdiction over USIA, to take up LDC's request. Support for the exemption came from other local members as well, including Robert Andrews of New Jersey and John Fox of Pennsylvania. In addition, Sens. Arlen Specter (C '51), of Pennsylvania; Joseph Biden, of Delaware; Richard Lugar, of Indiana; and Jesse Helms, chair of the Senate Foreign Relations Committee, gave the necessary nod to make things happen.
In introducing the bill on the floor of the House, Gilman cited Penn as the host of the LDC and stated his support for the importance of the LDC's research.
Under a bill passed in the last days of the 104th Congress, the USIA was authorized to make "computer readable multilingual text and recorded speech in various languages," available to LDC upon request. The measure will expire in five years.
Technically, the bill was batched with a host of matters considered non-controversial and passed under a procedure called a suspension of the rules. The procedure allows such matters to bypass the normal lengthy committee process and come before the full House for quick action. A similar procedure occurred in the Senate.
Currently the LDC and USIA representatives are working out a schedule and technical details for making the data available to researchers.
Although LDC has collected massive amounts of linguistic information from all over the worldÑmillions of sentences and phrases taken from everyday sources such as newspapers, novels, radio broadcastsÑVOA and Radio Marti are uniquely suited to the kind of analysis LDC performs, Lenker notes. Because the identical broadcast is made in 52 different languages, computers can study the statistical relationships among the words in context.
"LDC researchers will not be reading this material for content," Lenker said. "What is important is the use of the computer to process this vast source of linguistic material." One important use is the construction of "language models" which can give computers a sense of what word belongs in what context.
For example, both "last" and "lost" are adjectives in English and both could be used to modify the noun "year." However, the phrase "last year" occurs in news-wire text more than 300 times per million words, while "lost year," although sensible, is extremely unlikely. The computer uses these context clues in much the way humans do to understand the meaning of the phrase.
Computers with the capability to understand subtleties of human language will usher in a new age of machines that will be able to transcribe human speech instantly, find specific information more precisely within large documents, and translate foreign languages with unsurpassed accuracy.
The latter capability is enormously important for foreign trade. Firms seeking to do business in global markets would benefit from technology that breaks downs language barriers. In addition, Lenker notes that the computer's potential for revolutionizing the teaching of languages is just beginning to be grasped. LDC has begun several projects, most notably in Mandarin Chinese and Spanish, to make language instruction more effective. Computers have great promise for improving language "maintenance"Ñkeeping foreign language skills sharp during periods when there are few chances for use.
Under Mark Liberman's directorship, LDC, a nonprofit organization founded with an initial grant of $5 million from the Advanced Research Projects Agency (ARPA), is a broadly based consortium of 80 universities, companies and government agencies that funds activities through membership fees and data sales. LDC already has 11 continuous data streams with data being fed to its computers through satellite feed, modems and Internet newsgroups, providing text in 12 languages. LDC also collects spoken data in a similar range of languages.
"This opportunity to distribute USIA resources will be an incredible tool for teaching and researching languages," Lenker said. "The sky's the limit."