Open Password – Monday April 25, 2022
#1058
FAZ archive – Archiving of podcasts – Marketing of podcasts – Birgitta Fella – Jens Peter Kutz – Hans Peter Trötscher – Caspar Dawo – Frankfurter Allgemeine Zeitung – FAZ.net – Automatic indexing – Speech-to-text transcription – Indexing software – Samples – Test set – Indexing quality – Implementation in productive operation – Re-indexing – Re-indexing – Quality optimization – Machine re-indexing – Search engine optimization
Outsell – Balance Between Remote and In-Office Work – Product Renewal – Price Parity – New Participants in the Buying Process – New Skills – HAW – ZBW – Application for University Professorship – Research Data and Publication Markets – Doreen Siegfried – INCONNECS – Isabell Welpe – Digital Transformation – Potential of AI for Libraries – Tamara Pianos
- Title:
FAZ archive
Workshop report “Architecting and marketing audio data” – Part two –
By Birgitta Fella, Jens Peter Kutz, Hans Peter Trötscher and Caspar Dawo
Outsell: Listening to the customerStriking a Balance Between Remote and In-Office Work – A Need for New Skills to Monitor Buyers at Product Level
III.
HAW and ZBWLast minute application for a university professorship
INCONECSSDigital Transformation and the Potential of AI for Libraries
FAZ archive
Workshop report “Architecting and marketing audio data”
Second part
By Birgitta Fella, Jens Peter Kutz, Hans Peter Trötscher and Caspar Dawo
FAZ podcast: How do I explain it to my child?
_____________________________________________________
Task 2: Automatic indexing in the FAZ archive
_____________________________________________________
Even before podcasts were archived, the FAZ archive carried out a uniform content indexing (indexing) of the previous article texts for external and internal purposes, using the entire inventory available in the TRIP databases. This enables convenient research for both internal users (archive, editorial team) and external customers (online access) and supports product creation (dossiers, CD-ROMs, etc.). Indexing has been supported by automated processes since 2003. The ClassifyServer from d.velop classification consulting GmbH is used , which generates corresponding Indexat suggestions for almost all indexing categories. A semi-automatic process is used for the majority of the sources: daily advance indexing for in-house and third-party full-text sources by the software, then checking and manual post-processing of the (relevant) articles by the archive’s documentarians.
A statistical procedure is used for factual topics (notations, countries, etc.). New documents are assigned indexes if, after a similarity comparison with older, already indexed documents, there is a defined statistical probability that the indexate is correct ( probabilistic approach ). The prerequisite for this is the largest possible, multi-topic inventory of documents with a controlled index (training corpus). Extensive name lists are used to index entities (companies, people, etc.), which are intellectually “enriched” with synonyms and checked for homonyms. The vocabulary in the document to be indexed is compared with the entries in the lists (character string matching). If an entity name (or its synonyms) occurs sufficiently often or in a prominent place (headings, captions) in the document, this name is suggested as an index term.
The quality of the automatic indexing in recall and precision in the FAZ archive is very high, especially for the daily print edition of the in-house newspaper. The prerequisite for this is the continuous, daily optimization of the training corpora or name lists used.
For the new media category “Podcast” in its form transcribed using the speech-to-text process, differentiated indexing should also be carried out using the usual categories of the FAZ archive, so that media-independent, cross-holding research across the entire source pool of the archive is possible. For time-saving and other economic reasons, the plan from the start was to develop the podcasts exclusively fully automatically .
Podcast Indexing Challenges. The automatic indexing of podcasts faces some challenges specific to this category of sources:
- The transcribed podcasts have a reduced structure that differs from most print sources. In particular, (sub)headings and captions are missing. As a result, they offer less evaluable information for the indexing process, for example about word positions in the document. Added to this is the relative length of the documents (60 minutes of podcast results in around 10 A4 pages). This results in a distribution and frequency of relevant names within a document that is completely different from most print sources.
- The automatic recognition quality depends on the similarity of the podcast topics to topics from the archived and indexed article inventory. In general, the rule applies: the closer a document to be indexed is thematically to documents from the training corpus, the better the thematic classification will be. New topics are harder to identify. The same applies to special topics, topics with a strong “feuilletonistic” tone and also ambiguous topics. In the podcasts in particular, topics such as these are sometimes discussed in great detail, for which there are only – if at all – insufficient thematic counterparts in the print sources.
- The quality of the speech-to-text transcription is crucial for entity recognition: names can only be recognized if their character string in the document corresponds to a character string in one of the name lists used. Incorrect transcriptions can therefore have a negative impact on recall (e.g. “President Both” instead of “President Biden”).
- Polythematic podcasts are a particular challenge (currently FAZ Podcast for Germany , AZ Objection and FAZ Books Podcast ): Since the automatic indexing process attempts to find documents in the training corpus that are as similar as possible to a document to be indexed, this search remains fruitless here Training corpus hardly contains documents in which the same topics are discussed within a single document. Usually the indexing software will not find a “suitable” document and will not assign an index.
- A certain multi-topic nature often arises from the conversation situation (situational digression), when the participants in the conversation “chat” a lot and many related topics are touched on in what is actually a monothematic podcast. This also makes it more difficult to assign matching reference documents from the training corpus.
- The podcasts to be indexed contain transcribed spoken language, in contrast to the documents in the training corpus, which consist exclusively of written (press) texts. This deviating word material in the form of colloquial expressions, filler words, possibly dialects or slang is largely unknown to the indexing software and could therefore pose problems (incorrect indexing or non-indexing).
- Finally, unlike most print sources, some podcasts contain specific topic sections that cannot be meaningfully indexed (introductions, transitions, news blocks or advertising introductions).
Procedure for the preliminary tests. The diverse setting options that the ClassifyServer indexing software offers are extremely convenient for specifically influencing the quality of the automatic indexing. However, this requires extensive indexing test runs to determine the optimal settings before starting productive keywording. While the evaluation of test data usually involves a comparison and statistical analysis of manual (controlled and validated) indexes with automatically assigned indexes from one and the same indexing test run, manual indexes could not be used in the preliminary tests for the new source category “Podcast”. become. The tried and tested automated statistical evaluation method used in the FAZ archive could therefore not be used. Instead, the tests were evaluated through more time-consuming intellectual review and evaluation of the automatic indexes.
Against this background, several series of tests were carried out with different settings for threshold values and various other parameters. To get started, the settings that have been optimized for years were selected and apply to productive operation for the FAZ print edition. To ensure comparability, an immutable sample test set of twenty transcribed podcasts from different thematic series was assembled. The evaluation results of the various test series were presented in a continuous table and graphically prepared for everyone involved in the project.
Results and findings. During the course of the project, the tests showed a continuous improvement in the indexing quality. This development was promoted by the simultaneous optimization of speech-to-text recognition performance (evaluation of various service providers). The quality of the automatic indexing of the podcasts achieved at the end of the last test phase can be rated as good overall. It can be compared to the – even better – productive indexing quality of the print edition of the FAZ, which serves us as a “benchmark”.
As expected, the indexing quality, particularly for the factual topics (notations), was very dependent on the specific topics of the respective podcasts. There are also differences in quality between the individual podcast series. The indexing quality remains unsatisfactory for vague, difficult-to-define topics or those that invite rambling, as well as for conversation partners who are having casual conversations. This problem area represents a lasting challenge for the automatic keywording of podcasts.
Also to be expected was the poor indexing quality for podcasts that cover multiple topics. In order to achieve good indexing quality for such polythematic podcasts, it would be necessary to provide the individual topic sections as individual, separate data sets for the indexing software. Since the manual effort involved in continuously separating topics in productive operation would be too high (estimated processing time of more than ten minutes per podcast), this problem area awaits the development of a technical possibility for the automated separation of a solution.
In contrast, the feared negative impact of incorrect recognition during speech-to-text conversion turned out to be largely unfounded. On the one hand, the recognition rate is actually very high and has been continuously improved over the course of the project – these transcription errors therefore do not represent a serious problem for the keywording of names. On the other hand, transcription errors generally have a rather small influence on the indexing quality of factual topics, since the comparison with the Training corpus is based on very large amounts of words, so that occasionally different spellings do not have a disruptive influence.
In addition, the topic sections within the podcasts that could not be meaningfully indexed (advertising, transitions, etc.) had little influence on the indexing quality. The brevity of these sections is less significant than expected compared to the average length of individual podcast episodes. This is also to be seen as positive because an automatically controlled exclusion of such sections would currently not be technically feasible in productive operation.
Implementation in productive operation. On December 1, 2021, the FAZ archive successfully started the productive indexing of the transcribed podcasts. In principle, an identical productive workflow applies to the indexing of podcasts as to all other sources that are currently automatically indexed – a workflow that is largely automated (script-controlled) by the FAZ archive database group: As soon as the conversions are delivered by the service provider and are stored in the TRIP database, the standardized steps “Document export” – “Document indexing” – “Indexate import (to TRIP)” are carried out.
In parallel to the continuous indexing of the current, ongoing podcast material, the podcasts published before December 1, 2021 were re-indexed. This re-indexing was already completed at the end of January 2022, so that the entire inventory of the FAZ podcasts, which have been published since November 2017 and comprise over 1,500 episodes, is available for professional research using the indexing categories of the FAZ archive.
Even after the project has ended, work on podcast indexing continues: As with all automatically indexed sources, the indexes of podcasts are continuously optimized in production for the purpose of further quality optimization in the future. The indexing quality of a number of other sources has been continuously improved over the past few years, and podcasts certainly still offer potential. The options include a complete re-indexing (second run) with improved procedures or possibly (limited) semi-automatic re-indexing – as the FAZ archive did, for example, with the extensive, also fully automatically indexed retro holdings for the period 1949-1992 – are at least worth considering, especially since further insights are gained about the podcasts in the daily practice of research and marketing.
With the development and implementation of this process, the FAZ archive is one of the forerunners and pioneers in the German press landscape in the processing, structured archiving and marketing of multimedia content. It can also be assumed that using the transcripts for search engine optimization (SEO) will significantly increase the reach of the podcasts themselves.
Outsell: Listening to the customer
Striking a Balance Between Remote
and In-Office Work
A Need for New Skills to Monitor Buyers
at Product Level
We continue to hear about a broad range of approaches to the return to the office. On one end of the spectrum, there are companies that are still staying fully remote. On the other end, those requiring a full return: five days a week. More interesting are the approaches in the middle, some of which have been quite innovative.
Companies are trying to strike a balance between remote and in-office work to try to gain the benefits of both while offsetting the potential disadvantage of an all-or-nothing policy. For example, some are asking workers to be in the office a certain number of days per week, often two or three. Others are using flexible arrangements, such as having those who work together coordinate the days they are in the office for collaboration purposes.
*
Product renewal discussions are becoming more complicated as suppliers try to make up for lost revenue. They are looking to get “paid” for the flexibility they’ve shown during the pandemic. In fact, the old trope that they are required to show regulators price parity has been resurfaced.
Another topic relates to demands placed on buyers by new enterprise participants in the buying process (new organizational structures, new CDOs, etc.). Buyers are not only being asked to show market understanding and alignment with business cases — they are being asked to supply specific product analyzes and comparisons as part of their new and renewal purchase justifications. Buyers are seeing that they need new skills/capabilities to monitor suppliers at the product level and communicate the findings and alternatives to new audiences.
HAW and ZBW
Last minute application for a university professorship
Dear Mr. Bredemeier,
They’re so well wired. If necessary, could you send this job advertisement through your channels?
It’s about a joint professorship between HAW and ZBW
PROFESSORSHIP FOR THE FIELD OF “RESEARCH DATA AND DIGITAL PUBLICATION MARKETS”.
The application deadline ends on April 28, 2022.
Thanks!
Kind regards, Doreen Siegfried,
ZBW – Leibniz Information Center for Economics .
INCONECSS
Digital Transformation and
the Potential of AI for Libraries
Dear Mr. Bredemeier,
In May we are hosting the INCONECSS conference. Although it is aimed primarily at information institutions in the context of business/economics, many of the topics go beyond the technical focus, so one or the other topic could also be of interest to a larger group.
If you would like to share the event notice in Password, for example, I would be very happy.
INCONECSS – International Conference on Economics and Business Information
Free online conference from 17th-19th May 2022
There will be a mixture of live content and asynchronous content that can be viewed on the conference platform by registered participants.
The keynote “The next chapter for research information: decentralized, digital” will be held by Professor Isabell Welpe (Technical University of Munich, Germany). The business economist Isabell Welpe is an expert for the digital transformation of companies and the future of leadership and work/organizational design.
https://www.inconecss.eu/keynote/
The panel discussion is on “Potential of AI for Libraries: A new level for knowledge organization?” On the panel, we will bring together experts from different backgrounds: Research, AI, Libraries, Thesaurus/ Ontology.
Topics covered during the conference will be Cooperation, Open Access, Corona effects, AI and Structured Data, Research and Teaching support or Onboarding, Information Literacy, Identifying trustworthy Conferences and much more.
Conference Program : https://www.inconecss.eu/programme/
Registration : https://www.inconecss.eu/registration/
The conference is free but you need to register in order to access the platform.
Twitter hashtag: #INCONECSS
Best regards, Tamara Pianos, Head of Information Communication,
ZBW – Leibniz Information Center for Economics
OpenPassword
Forum and news
for the information industry
in German-speaking countries
New editions of Open Password appear three times a week.
If you would like to subscribe to the email service free of charge, please register at www.password-online.de.
The current edition of Open Password can be accessed immediately after it appears on the web. www.password-online.de/archiv. This also applies to all previously published editions.
International Cooperation Partner:
Outsell (London)
Business Industry Information Association/BIIA (Hong Kong)
Open Password Archive – Publications
OPEN PASSWORD ARCHIVE
DATA JOURNALISM
Handelsblatt’s Digital Reach