Fonetik 2022

fonetik.se

Välkommen till det 33e svenska fonetikmötet, Fonetik 2022. Mötet går från lunch till lunch 13-15 juni 2022 på Tal, musik och hörsel, Kungliga Tekniska Högskolan (KTH), Stockholm (se "Hitta hit" för mer information). Welcome to the 33rd Swedish phonetics meeting, Fonetik 2022. The meeting takes place from lunch to lunch at KTH Speech, Music & Hearing, KTH Royal Institute of Technology, Stockholm (see "Venue" for more info).

På den här webbplatsen hittar du samlad information om mötet. Innehållet på platsen uppdateras kontinuerligt under upprinningen till mötet och en tid därefter, sedan fryser vi innehållet men behåller webbplatsen som ett arkiv. Missing

Mötet har som vanligt uppmuntrat bidrag inom alla områden med anknytning till fonetik och tal - till exempel fonologi, lingvistik, datorlingvistik, logopedi, röstforskning, musik och sång, konversationsanalys och samtalsanalys, psykologi, kognitionsvetenskap, tal- och språkteknologi, signalprocessning, och maskininlärning. As usual, the meeting invites contributions from all areas related to phonetics and speech, for example phonology, linguistics, computational linguistics, speech therapy, voice research, music and song, discourse analysis and conversation analysis, psychology, cognitive science, speech and language technology, signal processing and machine learning. I enlighet med mötets tradition har vi speciellt uppmuntrat bidrag från studenter och forskare med fonetikanknytning som ännu inte ör bekanta med det svenska fonetikmötet och dess forskningsnätverk. In line with the meeting's tradition, we have especially encouraged contributions from students and researchers that are not yet familiar with the Swedish phonetics meeting and its associated network of researchers.

Bidragen på fonetikmötet handlar om både svenska, nordiska, och mer främmande språk, och mötet har traditionellt haft regelbundna besök av forskare från grannländerna och inte sällan mer långväga deltagare. Språket för presentationer och plenara diskussioner är sedan mitten av 2010-talet engelska, vilket ger bättre möjligheter både för besökare och för doktorander från andra språkmiljöer. The contributions at the phonetics meeting are about Swedish, Nordic and other, more foreign languages, and the meeting is often visited by researchers from neighbouring countries and occasionally more long-distance participants. The language for presentations and plenary discussions has been English since the mid-10s, which provides better opportunities for both visitors and doctoral students from other language environments to participate.

Schema Schedule

fonetik.se

11:00

Lunch

11:50 - 13:00 Lunch

Lunchen är dropin, för de som har lust att komma, på Östra Station nära KTHs tunnelbana. Vi har inte bokat bord, men det bör finnas plats. Kostnaden för lunchen ingår inte i Fonetik 2022. The lunch is dropin, for those who wish to join, at restaurant Östra Station close to KTH subway station. We have not booked tables but there should be room. The lunch is not paid by the conference.

13:00

Welcome!

13:00 - 13:20

Emergent speech behaviours/speech articulation

13:20 - 14:20

David House

Birdsong as model for infants’ emergent speech – a brief introduction

Axel G. Ekström

Rapid movements at segment boundaries – preliminary reports on manner

Malin Svensson Lundmark

The time course of onset CV coarticulation

Tugba Lulaci, Mechtild Tronnier, Pelle Söderström, Mikael Roll

Break

14:20 - 15:00

15:00

Relationen mellan tal och andra områden samt tillämpningar The relation of phonetics to other fields and applications

15:00 - 16:00

Jens Edlund

Sofia Strömbergsson

LogopediSpeech and Language pathology, KI

Christine Ericsdotter Nordgren

Språkstudion, SU

Zofia Malisz

Tal, musik och hörselSpeech, Music & Hearing, KTH

Morgan Fredriksson

Nagoon/Liquid Media

16:00

Break

16:00 - 16:20

History of phonetics and speech science

16:20 - 18:00

Joakim Gustafson

Gunnar Fant

Johan Malmstedt

Hypotheses should better be well-founded and not just testable

Hartmut Traunmüller

17:00 - 18:00

Another half a century in speech research

Björn Granström, Rolf Carlson

Det här är ett utökat bidrag och innehåller andra inslag utöver ren presentation This is an extended presentation and includes other elements than pure presentation

18:00

Reception

18:30 - 21:00

9:00

Glömda färdigheter: spektrogramläsning Forgotten skills: spectrogram reading

9:00 - 9:40

David House

Här presenteras även vinnaren av den stora spektrogramläsningstävlingen! The includes the reveal of the winner of the great spectrogram reading contest!

Coffee

09:40 - 10:10

Voices of humans and machines

10:10 - 11:30

Mikael Roll

The prosody of surprise questions and exclamations as compared to information-seeking questions in Estonian

Eva-Liina Asu

Creaky voice in South Swedish accent 1

Anna Hjortdal

Deep learning for phonetically meaningful speech manipulation

Gustavo Teodoro Döhler Beck, Ulme Wennberg, Zofia Malisz, Gustav Eje Henter

The voice-mapping system FonaDyn – overview and demo

Sten Ternström

Det här är ett utökat bidrag och innehåller andra inslag utöver ren presentation This is an extended presentation and includes other elements than pure presentation

Walk to lunch

Lunch

11:50 - 13:00 Lunch

13:00

Speech production by humans and machines

13:00 - 14:20

Petra Bodén

Spell new sounds with new letters. A study of how Swedish L2 learners’ spelling is affected by their L1

Cajsa Fransson, Malin Svensson Lundmark

Sardin: speech-oriented text processing

Christina Tånnander, Jens Edlund

Learning fast with fewer data samples using Neural HMMs

Shivam Mehta, Harm Lameris, Éva Székely, Jonas Beskow, Gustav Eje Henter

Spontaneous neural HMM TTS with prosodic feature modification

Harm Lameris, Shivam Mehta, Gustav Eje Henter, Ambika Kirkland, Birger Moëll, Jim O’Regan, Joakim Gustafson, Éva Székely

Break

14:20 - 15:00

15:00

Perception of human and machine speech

15:00 - 16:00

Mechtild Tronnier

Phonetic and phonological variation in vowel discrimination performance: effect of Swedish vowel categories and dialects

Renata Kochančikaitė, Mikael Roll

Mapping specific characteristics of spoken text to listener ratings

Christina Tånnander, Jens Edlund

Formants in text-to-speech systems - comparing TTS voices of Blizzard Challenge 2013

Ayushi Pandey, Sébastien Le Maguer, Julie Carson-Berndsen, Naomi Harte

16:00

Break

16:00 - 16:20

Corpora, models and tools 1

16:20 - 17:40

Jonas Beskow

Feature selection for labelling of whispered speech in ASMR recordings using Edyson

Pablo Pérez Zarazaga, Zofia Malisz

The Visible Speech platform - a research infrastructure for secure analysis of speech recordings

Fredrik Karlsson

Ny plats i schemat pga en annan ändring. Tack Fredrik! NB! New slot to make room for another change. Thanks Fredrik!

Speech data augmentation for improving phoneme transcriptions of aphasic speech for the PSST challenge1

Birger Moëll, Jim O’Regan, Shivam Mehta, Ambika Kirkland, Harm Laméris, Joakim Gustafsson, Jonas Beskow

Hearing voices at the National Library - a speech corpus and acoustic model for the Swedish language

Martin Malmsten, Chris Haffenden, Love Börjeson

OBS! Ny plats i schemat pga planeringskonflikt hos författarna. NB! New slot due to a scheduling conflict on the authors' side.

18:00

Dinner

18:30 - 23:00

9:00

The commercial voice - a dying breed?

9:00 - 9:40

Christina Tånnander

Martin Forsström

Det här är en inbjuden talare och fyller hela sessionen. This is a keynote, and spans the entire session.

Coffee

09:40 - 10:10

Corpora, models and tools 2

10:10 - 10:50

Zofia Malisz

Vocal activity detection and speaker diarization in speech databases: a feasibility study

Fredrik Karlsson

Continued finetuning as single speaker adaptation

Jim O'Regan

Turn-taking

10:50 - 11:30

Björn Granström

Perception of F0 movements towards potential turn boundaries in German and Swedish conversation: background and methods for an eye-tracking study

Martina Rossi, Kathrin Feindt, Margaret Zellers

The influence of prosody on turn-taking models at syntactically ambiguous places

Erik Ekstedt, Gabriel Skantze

Goodbye!

11:30 - 11:50

Lunch

11:50 - 13:00 Lunch

Kostnaden för luncher ingår inte i Fonetik 2022. Luncherna är inte heller bokade, utan dropin, men borden bör räcka. The cost for lunches isn't included in Fonetik 2022. Also, we haven't booked tables, but there should be room enough. Måndagens lunch är på Östra Station nära KTHs tunnelbana. The Monday lunch is at restaurant Östra Station close to KTH subway station. Tisdagens och onsdagens luncher blir på Harpaviljongen i Lill-Jansskogen. Tuseday and Wednesday lunch are at Harpaviljongen in Lill-Jansskogen.

14:00

TTS eval seminar

14:30 - 16:30

InbjudanInvitation RegisreringRegistration

Proceedings Proceedings

fonetik.se

Det här är förtryck. De officiella artiklarna publiceras i TMH_QPSR 3/2022 efter konferensens slut. These are preprints. The official proceedings will be printed in TMH-QPSR 3/2022 after the conference.

Birdsong as model for infants’ emergent speech – a brief introduction

Author	Axel G Ekström
Abstract	Songbirds have long and widely been considered a model species for the development of human speech capacities. Modelling efforts are dependent on parallels and similarities between emergent song and speech behavior. The present text describes eight such parallels, including, among others, neural lateralization, critical periods of development, and a dependency on auditory and perceptual feedback for normal development. The text takes as its unit of comparison patterns of speech observed in developing infants and patterns of song observed in juvenile songbirds, and serves at once as general summary of classic and contemporary research on the two phenomena, as well as a brief introduction to the topic.
Date	13-15 june 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	4
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Continued finetuning as single speaker adaptation

Author	Jim O’Regan
Abstract	The adaptation of unsupervised learning techniques to speech recognition have enabled the training of accurate models with less labelled training data, by finetuning a supervised classifier on top of a network pretrained using self-supervised methods. In this paper, we investigate if continuing the fine-tuning of such a model is suitable as a method of speaker adaptation for a single speaker, considering two kinds of user: the casual user, with data measurable in minutes, and the professional user, with data measurable in hours. We conduct experiments across a range of dataset sizes, in an attempt to provide a basis for estimates on how much data would be needed.
Date	13-15 june 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	4
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Creaky voice in South Swedish accent

Author	Anna Hjortdal
Abstract	Pitch and voice quality are increasingly understood as closely intertwined. While Swedish and Norwegian word accents have traditionally been understood in terms of pitch, Danish stød, which is systematically related when it comes to distribution and function, has been described as a type of creaky voice. According to the Laryngeal Articulator Model (LAM), both pitch lowering and creaky or harsh voice can be the acoustic outcomes of tightening the laryngeal constrictor mechanism. Laryngeal constriction has been proposed as the articulatory gesture behind word accents and stød. The present study investigated creaky voice in South Swedish word accents. Harmonics-to-noise ratio was significantly lower and jitter significantly higher in accent 1 compared to accent 2 stressed vowels. Further, jitter and shimmer was higher and spectral tilt was lower in sonorant consonants following stressed vowels. The results suggest that prototypical creaky voice is another cue to accent 1 in South Swedish and is in line with proposals that pitch falls in word accents correspond to laryngeal constriction.
Date	13-15 june 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	6
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Deep learning for phonetically meaningful speech manipulation

Author	Gustavo Teodoro Döhler Beck
Author	Ulme Wennberg
Author	Zofia Malisz
Author	Gustav Eje Henter
Abstract	The quality of synthetic speech has advanced rapidly in the last decade. Unfortunately, the new technologies have rarely proven to be useful for the speech sciences community. The modern methods lack direct and accurate control over important speech properties such as formants - necessary for stimulus creation in the speech sciences. Consequently, stimulus creation currently still relies on legacy methods that are typically based on task-specific signal processing. Consequently, using manipulated stimuli with audible signal processing artefacts may result in research findings that will not generalise to human perception of natural, artefact-free speech.
Date	13-15 june 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	4
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Feature selection for labelling of whispered speech in ASMR recordings using Edyson

Author	Pablo Pérez Zarazaga
Author	Zofia Malisz
Abstract	Whispered speech is a challenging area for traditional speech processing algorithms, as its properties differ from phonated speech and whispered data is not as easily available. A great amount of whispered speech recordings, however, can be found in the increasingly popular genre of ASMR in streaming platforms like Youtbe or Twitch. Whispered speech is used in this genre as a trigger to cause a relaxing sensation in the listener. Accurately separating whispered speech segments from other auditory triggers would provide a wide variety of whispered data, that could prove useful in improving the performance of data driven speech processing methods. We use Edyson as a labelling tool, with which a user can rapidly assign labels to long segments of audio using an interactive graphical interface. In this paper, we propose features that can improve the performance of Edyson with whispered speech and we analyse parameter configurations for different types of sounds. We find Edyson a useful tool for initial labelling of audio data extracted from ASMR recordings that can then be used in more complex models. Our proposed modifications provide a better sensibility for whispered speech, thus improving the performance of Edyson in the labelling of whispered segments.
Date	13-15 june 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	2
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Formants in text-to-speech systems - comparing TTS voices of Blizzard Challenge 2013

Author	Ayushi Pandey
Author	Sébastien Le Maguer
Author	Julie Carson-Berndsen
Author	Naomi Harte
Author	Sigmedia Lab
Abstract	Modern trends in synthesis evaluation attempt to capture finer aspects of the human experience of synthetic speech. However, a feature-based exploration of the synthetic speech signal, especially in comparison with human speech signal is still missing from the discussion.
Date	13-15 june 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	6
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Hearing voices at the National Library -a speech corpus and acoustic model for the Swedish language

Author	Martin Malmsten
Author	Chris Haffenden
Author	Love Börjeson
Abstract	This paper details our work in developing new acoustic models for automated speech recognition (ASR) at KBLab, the infrastructure for data-driven research at the National Library of Sweden (KB). We evaluate different approaches for a viable speech-to-text pipeline for audio-visual resources in Swedish, using the wav2vec 2.0 architecture in combination with speech corpuses created from KB’s collections. These approaches include pre-training an acoustic model for Swedish from the ground up, and fine-tuning existing monolingual and multilingual models. The collections-based corpuses we use have been sampled from millions of hours of speech, with a conscious attempt to balance regional dialects to produce a more representative, and thus more democratic, model. The acoustic model this enabled, “VoxRex”, outperforms existing models for Swedish ASR. We also evaluate combining this model with various pre-trained language models, which further enhanced performance. We conclude by highlighting the potential of such technology for cultural heritage institutions with vast collections of previously unlabelled audio-visual data.
Date	13-15 june 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	5
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Hypotheses should better be well-founded and not just testable

Author	Hartmut Traunmüller
Abstract	This is a contribution about the scientific method – about falsificationism and its non-applicability in phonetics and the life sciences in general – about the advantage of distinguishing between well-founded, provisional and fictitious hypotheses and the a priori confidence we can have in these types, marginally also about the principle of parsimony – and about path-dependence and the lockin effect of “normal science”.
Date	13-15 june 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	4
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Learning fast with fewer data samples using neural HMMs

Author	Shivam Mehta
Author	Harm Lameris
Author	Éva Székely
Author	Jonas Beskow
Author	Gustav Eje Henter
Abstract	The neural TTS paradigm synthesises significantly better-quality speech than the previous paradigm of HMM-based statistical parametric speech synthesis (SPSS). However, it requires a large amount of time and a larger corpus to learn the alignments between text and speech because of the underlying nonmonotonic attention mechanism. This paper presents the benefits of merging a neural TTS system with a Hidden Markov Model (HMM) thus mixing these two paradigms and getting the best of both worlds. We replace the underlying attention mechanism in a neural TTS with an autoregressive left-to-right noskip HMM defined by a neural network. This results in a system which learns to speak 10 times faster, requires fewer training samples, does not break down into gibberish, is smaller in size, is fully probabilistic, and allows easy control over the speaking rate without compromising the naturalness of the audio.
Date	13-15 june 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	3
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Perception of F0 movements towards potential turn boundaries in German and Swedish conversation: background and methods for an eye-tracking study

Author	Martina Rossi
Author	Kathrin Feindt
Author	Margaret Zellers
Abstract	Understanding the turn-taking system in conversation entails not only knowledge about the linguisticstructural and phonetic before Potential Turn Boundaries (PTBs), but crucially, the precise location of the transition space as well. To investigate the time domain and the phonological domain, we compare production and perception of turn ends in two related languages: German and Swedish. For the first part, we extracted pitch values at seven time points before PTBs from spontaneous speech produced in two-party conversations. The aim was to investigate the possible presence of specific patterns of variations that lead to either speaker change, floor keeping or backchannels. As no such patterns have emerged, for the second part, eye-tracking will be used to investigate the exact time point at which the ending of a turn can be projected by a listener and which acoustic signals are important for this prediction.
Date	13-15 june 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	4
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Phonetic and Phonological Variation in Vowel Discrimination Performance: Effect of Swedish Vowel Categories and Dialects

Author	Renata Kochančikaitė
Author	Mikael Roll
Abstract	Acoustic discrimination of speech sounds is affected by various factors, ranging from more universal acoustic properties of categories to the phoneme systems of the native language and dialect, and even influences from languages learned later in life. A discrimination experiment containing East Central Swedish vowels was carried out with 30 native Swedish listeners in order to explore the variation in vowel discrimination performance. Both phonetic and phonological variables have been found to have an effect on discrimination performance. Peripheral location of vowels in the F1/F2 vowel space was found to increase the discrimination performance. South Swedish dialectal area was associated with a decreased discrimination performance. Continuous exposure to foreign languages other than English was not a significant factor.
Date	13-15 june 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	5
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Rapid movements at segment boundaries – preliminary reports on manner

Author	Malin Svensson Lundmark
Abstract	This paper reports on a one-to-one relation between articulation and acoustics. It explains how segment boundaries are a result of rapid movements of the articulators. In the acceleration profile, this is identified as peak acceleration, which can be measured. A previous study found that rapid movements of an active articulator – peak acceleration – correlate with the acoustic segment boundary in bilabial and alveolar nasals ([m] and [n]). The purpose of the present paper is to extend this line of research and report on some preliminary data on other manner as well ([p], [b], [l]). The results of both studies show that the one-to-one relationship between acoustics and articulation exists both in different places of articulation and in different manner of articulation.
Date	13-15 June 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	6
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Sardin: speech-oriented text processing

Author	Christina Tånnander
Author	Jens Edlund
Abstract	We present Sardin, a text processing system for Swedish TTS production that has recently undergone significant refactoring in preparation for public release and is soon released as free and open software. Sardin is a text processing system with the goal to prepare text for speech-centric science, such as preparing text for speech synthesis training, or for use in speech applications, for example as input of different levels and detail to different TTS systems. The current version of Sardin handles several input and output formats (EPUB, Daisy XML, generic XML, text, IPA, SAMPA), and contains modules for chunking, tokenisation, part-of-speech tagging, text normalisation, and pronunciation and prosodic information.
Date	June 13-15 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	5
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Speech data augmentation for improving phoneme transcriptions of aphasic speech for the PSST challenge

Author	Birger Moëll
Author	Jim O'Regan
Author	Shivam Mehta
Author	Ambika Kirkland
Author	Harm Laméris
Author	Joakim Gusafson
Author	Jonas Beskow
Abstract	As part of the PSST challenge, we explore how data augmentations, data sources, and model size affect phoneme transcription accuracy on speech produced by individuals with aphasia. We evaluate model performance in terms of feature error rate (FER) and phoneme error rate (PER). We find that data augmentations techniques, such as pitch shift, improve model performance. Additionally, increasing the size of the model decreases FER and PER. Our experiments also show that adding manually transcribed speech from non-aphasic speakers (TIMIT) improves performance when Room Impulse Response is used to augment the data. The best performing model combines aphasic and non-aphasic data and has a 21.0% PER and a 9.2% FER, a relative improvement of 9.8% compared to the baseline model on the primary outcome measurement. We show that data augmentation, larger model size, and additional non-aphasic data sources can be helpful in improving automatic phoneme recognition models for people with aphasia.
Date	13-15 June 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Spell new sounds with new letters. A study of how Swedish L2 learners’ spelling is affected by their L

Author	Cajsa Fransson
Author	Malin Svensson Lundmark
Abstract	The current study examines whether there is a connection between Swedish L2 students' L1 and their spelling in Swedish. The data were collected through a spelling test conducted at SFI course levels C and D. The experiment was conducted in an urban area in Småland in three groups consisting of 37 course participants with 12 different L1s. People with Arabic as their L1 are the focus of the study due to the selection. The results show that the spelling mistakes could to some extent be explained by the Arabic phoneme set. For example, consonant pairs were confused to a greater extent when one of the consonants was not in Arabic (i.e. p/b, k/g, v/f) than when both consonants in the pair are in Arabic (i.e. t/d, r/l).
Date	June 13-15 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	6
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Spontaneous neural HMM TTS with prosodic feature modification

Author	Harm Lameris
Author	Shivam Mehta
Author	Gustav Eje Henter
Author	Ambika Kirkland
Author	Birger Moëll
Author	Jim O’Regan
Author	Joakim Gustafson
Author	Éva Székely
Abstract	Spontaneous speech synthesis is a complex enterprise, as the data has large variation, as well as speech disfluencies normally omitted from read speech. These disfluencies perturb the attention mechanism present in most Text to Speech (TTS) systems. Explicit modelling of prosodic features has enabled intuitive prosody modification of synthesized speech. Most prosody-controlled TTS, however, has been trained on read-speech data that is not representative of spontaneous conversational prosody. The diversity in prosody in spontaneous speech data allows for more wide-ranging data-driven modelling of prosodic features. Additionally, prosody-controlled TTS requires extensive training data and GPU time which limits accessibility. We use neural HMM TTS as it reduces the parameter size and can achieve fast convergence with stable alignments for spontaneous speech data. We modify neural HMM TTS to enable prosodic control of the speech rate and fundamental frequency. We perform subjective evaluation of the generated speech of English and Swedish TTS models and objective evaluation for English TTS. Subjective evaluation showed a significant improvement in naturalness for Swedish for the mean prosody compared to a baseline with no prosody modification, and the objective evaluation showed greater variety in the mean of the per-utterance prosodic features.
Date	June 13-15 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	4
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

The influence of prosody on turn-taking models at syntactically ambiguous places

Author	Erik Ekstedt
Author	Gabriel Skantze
Abstract	Turn-taking is a fundamental aspect of human communication and is the ability to organize turns, between the interlocutors, at appropriate locations throughout a conversation. In this work we investigate the influence of prosody on turn-taking using the recently proposed Voice Activity Projection model, which incrementally models the upcoming speech activity of the interlocutors in a self-supervised manner, without relying on explicit modelling of prosodic features, or specific annotations of turn-taking events. Inspired by psycholinguistic experiments we focus our analysis on single utterances containing syntactically ambiguous places, specifically designed to depend on prosody. We further investigate the implicit influence of prosody on the turn-taking model through prosodic manipulation of the speech signal.
Date	June 13-15 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	7
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

The time course of onset CV coarticulation

Author	Tugba Lulaci
Author	Mechtild Tronnier
Author	Pelle Söderström
Author	Mikael Roll
Abstract	The study investigates the center of gravity in onset fricatives as a main acoustic feature to assess the relation between vowel pronunciation and coarticulatory spectral characteristics of the onset consonant. /s/- and /f/-initial CV sequences were analyzed with backness, roundedness and height of the vowel as predictors of fricative center of gravity. Results showed that the first 15 ms of an onset fricative could carry predictive cues to the upcoming vowel.
Date	June 13-15 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	4
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

The voice-mapping system FonaDyn – overview and demo

Author	Sten Ternström
Abstract	The voice is notoriously variable, and conventional measurement paradigms are weak in terms of providing evidence for effects of treatment and/or training of voices. New methods are needed that can take into account the variability of scalar metrics across the voice range. The voice map, a generalization of the phonetogram, offers a frame of reference that can be used in many ways, for research and in the clinic. FonaDyn is a proof-of concept workbench that we are developing in order to explore and validate the mapping measurement paradigm. In this demo, you can try FonaDyn, to visualize and measure your own phonation faster and in greater detail than ever before.
Date	June 13-15 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	2
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Vocal activity detection and speaker diarization in speech databases: a feasibility study

Author	Fredrik Karlsson
Abstract	The task of creating speech corpora for phonetic research is time-consuming and could be alleviated by automatic algorithms to provide draft indexing of speech acts. The present investigation assessed the feasibility of applying speech segmentation and speaker diarization models across a collection of recordings to produce a draft indexing that could be utilised by speech management systems to help the researcher to navigate a corpus. The results show that a readily available model for speech segmentation is very likely to contribute to the effectiveness of speech annotation workflows in phonetic research. Speaker diarization models may require specific training to manage consistent speaker separation across a speech corpus, and the evaluated model currently offers no clear advantage to the effectiveness of a speech corpus creation process.
Date	13-15 June 2022
Language	en
Place	Stockholm
Publisher	KTH Royal Institute of Technology
Pages	4
Proceedings Title	Proc. of Fonetik 2022
Conference Name	Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Preprint	PDF (DOI pending)

Kontakt och organisation Contact & organisation

fonetik.se

KontaktContact

Registreringen är stängd, men har ni frågor om registrering så går det bra att kontakta oss på Fonetik 2022s epost: The registrerion is closed now, but if you have questions about registration, you can contact us at Fonetik 2022's email: 2022@fonetik.se.

Organisation

Organisationskommitten består av Jens Edlund, Christina Tånnander, Zofia Malisz och David House. The organisation committee consists of Jens Edlund, Christina Tånnander, Zofia Malisz and David House. Vi har haft benägen hjälp av många både i och utanför TMH, men spciellt ska nämnas Lia Malm, Jim O'Regan, Ghazaleh Esfandiari, Axel Exström och framför allt Rolf Carlson och Björn Granström som bidragit på alla tänkbara vis. We've had help from a number of people both from TMH and from elsewhere, but special mention goes to Lia Malm, Jim O'Regan, Ghazaleh Esfandiari, Axel Exström and above all Rolf Carlson and Björn Granström who contributed in every concievable manner.

Fonetikstiftelsen

Ett extra tack till Fonetikstiftelsen som även i år bidragit till att hålla nere kostnaderna för deltagarna. Special thanks to Fonetikstiftelsen for once again helping to keep down the costs for the participants.

Fonetik 2022

Hitta hit Venue

Schema Schedule

10:00

12:00

17:00

21:00

23:00

Monday

11:00

13:00

Emergent speech behaviours/speech articulation

Birdsong as model for infants’ emergent speech – a brief introduction

Rapid movements at segment boundaries – preliminary reports on manner

The time course of onset CV coarticulation

15:00

Relationen mellan tal och andra områden samt tillämpningar The relation of phonetics to other fields and applications

16:00

History of phonetics and speech science

Gunnar Fant

Hypotheses should better be well-founded and not just testable

Another half a century in speech research

18:00

Tuesday

9:00

Glömda färdigheter: spektrogramläsning Forgotten skills: spectrogram reading

Voices of humans and machines

The prosody of surprise questions and exclamations as compared to information-seeking questions in Estonian

Creaky voice in South Swedish accent 1

Deep learning for phonetically meaningful speech manipulation

The voice-mapping system FonaDyn – overview and demo

13:00

Speech production by humans and machines

Spell new sounds with new letters. A study of how Swedish L2 learners’ spelling is affected by their L1

Sardin: speech-oriented text processing

Learning fast with fewer data samples using Neural HMMs

Spontaneous neural HMM TTS with prosodic feature modification

15:00

Perception of human and machine speech

Phonetic and phonological variation in vowel discrimination performance: effect of Swedish vowel categories and dialects

Mapping specific characteristics of spoken text to listener ratings

Formants in text-to-speech systems - comparing TTS voices of Blizzard Challenge 2013

16:00

Corpora, models and tools 1

Feature selection for labelling of whispered speech in ASMR recordings using Edyson

The Visible Speech platform - a research infrastructure for secure analysis of speech recordings

Speech data augmentation for improving phoneme transcriptions of aphasic speech for the PSST challenge1

Hearing voices at the National Library - a speech corpus and acoustic model for the Swedish language

18:00

Wednesday

9:00

The commercial voice - a dying breed?

Corpora, models and tools 2

Vocal activity detection and speaker diarization in speech databases: a feasibility study

Continued finetuning as single speaker adaptation

Turn-taking

Perception of F0 movements towards potential turn boundaries in German and Swedish conversation: background and methods for an eye-tracking study

The influence of prosody on turn-taking models at syntactically ambiguous places

14:00

Socialt program Social programme

Proceedings Proceedings

Birdsong as model for infants’ emergent speech – a brief introduction

Continued finetuning as single speaker adaptation

Creaky voice in South Swedish accent

Deep learning for phonetically meaningful speech manipulation

Feature selection for labelling of whispered speech in ASMR recordings using Edyson

Formants in text-to-speech systems - comparing TTS voices of Blizzard Challenge 2013

Hearing voices at the National Library -a speech corpus and acoustic model for the Swedish language

Hypotheses should better be well-founded and not just testable

Learning fast with fewer data samples using neural HMMs

Perception of F0 movements towards potential turn boundaries in German and Swedish conversation: background and methods for an eye-tracking study

Phonetic and Phonological Variation in Vowel Discrimination Performance: Effect of Swedish Vowel Categories and Dialects

Rapid movements at segment boundaries – preliminary reports on manner

Sardin: speech-oriented text processing

Speech data augmentation for improving phoneme transcriptions of aphasic speech for the PSST challenge

Spell new sounds with new letters. A study of how Swedish L2 learners’ spelling is affected by their L

Spontaneous neural HMM TTS with prosodic feature modification

The influence of prosody on turn-taking models at syntactically ambiguous places

The time course of onset CV coarticulation

The voice-mapping system FonaDyn – overview and demo