Open Password – Wednesday April 20, 2022
#1056
FAZ archive – Archiving of podcasts – Marketing of podcasts – Birgitta Fella – Jens Peter Kutz – Hans Peter Trötscher – Caspar Dawo – Frankfurter Allgemeine Zeitung – FAZ.net – mp3 – aac – vorbis – opus – Content indexing – Workflow – Indexate – Teaser – Production metadata – Automation – Transcriptions – Dragon – Trint – Tool-e-Byte – Machine learning – Software efficiency – Data congruence – TRIP databases – ClassifyServer – d.velop classification consulting – Probabilistic approach
Digital legacy – Data collection mania – Unstructured data – Costs – Ecological footprint – Data protection – Gregor Bieler – Aparavi – Current data – Aggregated metadata – Transparency – GDPR – Data cleansing – Automated solutions – Cloud providers – Data storage – Data centers YouGov
I
Title
FAZ archive:
Workshop report “Archiving and marketing of audio data – By Birgitta Fella, Jens Peter Kutz, Hans Peter Trötscher and Caspar Dawo
II.
Digital legacy:
collecting unstructured data causes problems with costs, ecological footprint and data protection – By Gregor Bieler
FAZ archive
Workshop report “Architecting and marketing audio data”
part One
By Birgitta Fella, Jens Peter Kutz, Hans Peter Trötscher and Caspar Dawo
_____________________________________________________
Initial situation and task
_____________________________________________________
As with most other media companies, the Frankfurter Allgemeine Zeitung’s website consists of text and image material as well as multimedia content. In recent years, podcasts in particular have become increasingly important.
The multimedia section on FAZ.Net is currently home to ten podcasts:
- Podcast for Germany
- Objection
- To know
- Digitech
- AI
- Finance & Real Estate
- Health
- Books
- How do I explain this to my child?
The frequency of publication of the podcasts ranges from daily (podcast for Germany) to weekly (objection) to irregular publications several weeks apart (KI). Since January 20, 2020, more than 270 hours of audio material have been produced.
The podcasts are similar to sophisticated radio shows in their form and topic. Contributions from individual editors appear, but also formats with several speakers, such as interviews and discussions. There are podcast editions (episodes) that are dedicated to just one topic and those that, like a magazine, cover several topics in one issue.
You can find an overview of all podcasts from the Frankfurter Allgemeine Zeitung at https://www.faz.net/podcasts/. For each episode there is an accompanying text as well as the individual episode to listen to or download. All podcasts can be subscribed to free of charge.
The following data is available for each episode:
- title
- Publication date
- Accompanying text to the audio file
- Audio in mp3, aac, vorbis and opus formats
- Duration
- some original articles, further links, etc.
This data allows archiving with formal information, but no content indexing. To do this, the spoken language must be converted into machine-readable text (see below). With the production of the podcasts, the FAZ archive begins its obligation to both archive this material in an appropriate form and to monetize it in the course of secondary use. Our task was therefore to develop a workflow in which the data material was prepared in such a way that this obligation could be met.
During the production of the podcasts, some descriptive data is created that the producers use to supplement their mp3 audio files. However, these primarily serve to present the audios on the website. A detailed description of the content, keywords or even an index are not included. In addition to the formal information, which indicates, for example, the series affiliation with the serial number, the playing time and the source, this data only contains a short teaser as content information for listeners and subscribers.
Nevertheless, as can be easily seen from the finished product, the production metadata is an essential part of the overall resulting data set.
The first part of the project task was to automatically generate a text suitable for keywording and retrieval from the audio material with as little effort as possible and to combine this with the production metadata into a data set. The first recipient of the podcast data was Presse Monitor GmbH, which was supposed to make this data available with a link to the original mp3 file for marketing (e.g. in electronic press reviews). This marketing channel defined the design of the data as XML for the delivery of the documents.
FAZ podcast: How do I explain it to my child?
_____________________________________________________
Task 1: Transcribing the mp3 files.
_____________________________________________________
In order to find the suitable tool for transcribing the audio data, three main methods were tested:
- Dragon , an installable software from Nuance Communications Inc., which is primarily used to automatically produce documents from dictation in law firms, for example.
- Trint , a cloud solution that automatically generates texts from audio files and links the two together.
- Tool-e-Byte , a provider of conversion services based in Griesheim and India, which, in addition to conversion and linking, is able to generate an XML file that meets the requirements from converted text and production metadata.
To compare the tools with each other, 15 podcast files were transcribed from each tool. As a comparison, another five audio book mp3s were transcribed, which were based on a written text and recorded by professional speakers using the latest studio technology.
Dragon already failed at this hurdle because practically no proper names, foreign words and quotations were transcribed correctly. The syntax and the calculated punctuation marks were also completely arbitrary and incomprehensible. The value of the software efficiency (ie the proportion of correctly transcribed text) for the podcasts used in the test was around 50 percent (Dragon), around 55 percent (Trint) and around 75 percent (Tool-e-Byte). Using applied machine learning, Tool-e-Byte achieves an efficiency of over 80 percent on average for all speakers. Depending on the speaker, these values can vary up or down by up to ten percent.
The coefficient of error that Trint produced was a little better than Dragon’s, but still not convincing. In the end result, both tools were almost identical in terms of the results of the automatic indexing.
Even in the fully automatic basic version, Tool-e-Byte showed significantly better results than the other tools. In the case of the audio book texts, the transcribed version corresponded almost completely to the source text. The functional test and the existing solution for the production route ultimately tipped the scales in favor of this offer.
The finished XML file is delivered to PMG and the FAZ in a package with the mp3 audio file within a period of twice the podcast playing time. There the data is transferred to the respective database structure and marketed and made available for internal use.
The smooth transfer of podcast data for further use requires the data to be congruent. To do this, the metadata provided for the podcasts first had to be assigned to the existing data fields in the XML output format. A small difficulty on the side: The PMG XML format has so far been optimized for article data sets and does not provide any typical information for audio formats. That’s why, for example, the duration is displayed as a page number and the podcast series as a section. When generating the XML data set, Tool-eByte must meticulously adhere to the PMG conventions for source, file name and publication date and convert the source data of the podcasts accordingly.
After a week of testing, production went into regular operation at the beginning of November 2021 and has been supplying PMG with the data from the current podcasts ever since.
The FAZ archive also receives the current podcast data as well as all transcribed back files for archiving retroactively up to January 2020.
At the FAZ, the first internal use was to enrich selected FAZ.Net podcast pages with the associated transcriptions. Within a very short time, a clearly positive effect on the range was noticeable. The search engine spiders react much better to the transcriptions than to the previously exclusively used teaser texts and metadata.
The FAZ’s internal database also did not previously provide for the archiving of audio data. A separate, new database was created for the podcasts, which also takes special features of the podcasts into account. Based on the PMG-XML, which Tool-e-Byte transfers to the FAZ archive via FTP, the specifications for the podcast database were defined: ID and file name, inclusion of the runtime in the source information, short description, web link to the original podcast , link to mp3 audio file and podcast as a new article type. A new output format is being developed for further marketing of the podcast data.
The use of the podcasts in the FAZ archive and in further content marketing requires another step in data improvement: the automatic indexing of this new source.
Digital legacy
Gathering unstructured data brings problems with costs, ecological footprint and data protection
By Gregor Bieler, CEO EMEA at Aparavi***
By Gregor Bieler
Companies are accumulating more and more digital legacy assets. Will that become a problem? Below is my reality check based on surveys commissioned by Aparavi:
Check 1: Have companies become data messis? Anyone who is currently preparing the end-of-year business in e-commerce needs data, without question. A good offer at the right time can convert leads and increase the loyalty of existing customers. Current data is necessary for this – the focus is on current. Conversely, this means that no one will need the data that is collected today in the future – at least not to the depth at which it was collected. For long-term analyzes or recording trends, aggregated metadata should usually be sufficient. Complete data sets are still saved. This is not only the case in retail, but also in all other digital – i.e. practically all – companies.
Most companies still operate according to the principle of “more is more” and are accumulating veritable digital rubbish dumps. It doesn’t hurt anything, it doesn’t cost anything, if in doubt we just save everything – you still hear statements like that. However, this is a fallacy and may even be illegal.
Check 2: More data, more productivity? Do companies work better the more data they have available? In an Aparavi survey*, only 20 percent of study participants stated that their companies actively work with all existing data, while another third (34 percent) stated that they make use of almost all data. Conversely, this means that 46 percent of German companies do not use large parts of their data.
Among the study participants who say they have only used half or less of their data so far, 63 percent would like to use their company data more actively in the future. 43 percent said that lack of time prevented them from doing so. It is therefore important to implement intelligent solutions that create transparency in your own data jungle and help you decide which data is really valuable and worth preserving. The rest should then be deleted immediately.
Check 3: Data protection is maintained? You might think that as long as customer data is kept safe and not leaked to the outside world, everything is fine. Since the introduction of the GDPR, however, it has been regulated that data may only be retained for as long as it is needed for the purpose for which it was originally collected. In order not to come into conflict with the regulations, companies should definitely implement a deletion concept for personal data that is based on the legal deadlines. The key word here is data cleaning. Automated solutions that can structure data into data protection-relevant and irrelevant are a great help for companies.
Check 4: Collecting data costs nothing? Every day, unused data eats up resources, blocking storage space in data centers around the world. However, storage space is not seen as a scarce resource in the economic sense. If there is a lack of storage, additional terabytes can be quickly booked with the cloud provider of your choice with just a few clicks. The costs of this are seen as an inevitable tribute to digitalization and the consumption of resources is not noticed: Only 32 percent of managing directors and IT decision-makers in German companies know what data is available in their company – this was the conclusion of a study commissioned by Aparavi* .
This lack of overview also affects costs: 40 percent of participants in another study** stated that they paid up to 100,000 euros for data storage in the financial year. For 27 percent, the costs were over 100,000 euros and one in three could not name an amount when asked.
Conclusion. The immense hunger for electricity in data centers alone is becoming an immense problem for a society that is in the middle of the energy transition project of the century. Companies that do not manage to change and streamline their data strategy will have major problems in the long term, be it with regard to their ecological footprint, costs or data protection.
* YouGov conducted online interviews with 250 business owners, managers and IT decision-makers in Germany between April 23 and May 1, 2021 on behalf of Aparavi.
** YouGov conducted online interviews on behalf of Aparavi between September 1st and 13th, 2021 with 522 main and jointly responsible IT decision-makers in Germany.
***Aparavi promises to give companies full control over their unstructured data. More at https://aparavi.eu/de .
Read the final episode: Automatic indexing in the FAZ archive: Challenges for podcast indexing – procedure for preliminary tests – results and findings – implementation in productive operation
OpenPassword
Forum and news
for the information industry
in German-speaking countries
New editions of Open Password appear three times a week.
If you would like to subscribe to the email service free of charge, please register at www.password-online.de.
The current edition of Open Password can be accessed immediately after it appears on the web. www.password-online.de/archiv. This also applies to all previously published editions.
International Cooperation Partner:
Outsell (London)
Business Industry Information Association/BIIA (Hong Kong)
Open Password Archive – Publications
OPEN PASSWORD ARCHIVE
DATA JOURNALISM
Handelsblatt’s Digital Reach