Lecture IV - speech.ee.ntu.edu.tw

Lecture IV - speech.ee.ntu.edu.tw

AI Alchemy: Encoder, Generator, and put them together Hung-yi Lee Machine Learning Looking for a Function Binary Classification ( ) Multi-class Classificatio n ( ) Yes/No

Class 1, Class 2, Class N Function f Function f Input Input Machine Learning Looking for a Function Structured input/output f

Speech Recognition f Summarization girl with red hair and red eyes f (title, summary) http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3

Outline Auto-encoder Deep Learning Deep Generative Model Conditional Generation

Deep Learning in One Slide Many kinds of networks: Fully connected feedforward network (MLP) Convolutional neural network (CNN) Recurrent neural network (RNN) Vector Matrix They are functions.

Vector Seq How to find the function? Given the example inputs/outputs as training data: {(x1,y1),(x2,y2), , (x1000,y1000)} http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3 Outline Auto-encoder Deep Learning

Deep Generative Model Conditional Generation 28 28 A digit can be represented as a 28 x 28 dim vector Most 28 x 28 dim vectors are not digits 3 3 3 3 3 Unsupervised Leaning 20

10 0 10 20 Auto-encoder Low dimension NN Encoder 28 X 28 = 784 code

code Compact representation of the input object Learn together NN Decoder Can reconstruct the original object Deep Auto-encoder

Unsupervised Leaning NN encoder + NN decoder = a deep network As close as possible Code Output Layer Decoder Encoder Layer

Layer Layer bottle Layer Layer Layer Input Layer

^ Reference: Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507 Deep Auto-encoder - Example NN Encoder PCA

32dim Pixel -> tSNE Word Embedding Machine learn the meaning of words from reading a lot of documents without supervision tree NN Encoder flower run jump

dog rabbit cat tree To learn more https://www.youtube.com/watch? v=X7PH3NuYW0Q Word Embedding Machine learn the meaning of words from reading a lot of documents without supervision A word can be understood by its context

are something very similar 520 520 You shall know a word by the company it keeps Word Embedding Characteristics ( h ) ( h ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Solving analogies Rome : Italy = Berlin : ?

Compute Find the word w with the closest V(w) 13 Word Embedding - Demo Machine learn the meaning of words from reading a lot of documents without supervision Word Embedding - Demo Model used in demo is provided by Part of the project done by TA: Training data is from PTT (collected by ) 15

Audio Word to Vector Machine does not have any prior knowledge Machine listens to lots of audio book Like an infant [Chung, Interspeech 16) Audio Word to Vector Dimension reduction for a sequence with variable length audio segments (word-level) Fixed-length vector dog never dog never

dogs never ever ever Sequence-to-sequence Auto-encoder vector audio segment RNN Encoder The vector we want Can represent the whole

audio segment How to train RNN Encoder? x1 x2 x3 x4 acoustic features audio segment Sequence-to-sequence Input acoustic features Auto-encoder The RNN encoder and decoder are jointly trained.

x1 x2 x3 x4 y1 y2 y3 y4

RNN Encoder RNN Decoder x1 x2 x3 x4 acoustic features audio segment Sequence-to-sequence Auto-encoder Visualizing embedding vectors of the words

fear fame name near Sequence-to-sequence Auto-encoder Visualizing embedding vectors of the words say says hand day

days hands word words Audio Word to Vector Application spoken query US President user US President

US President Spoken Content Compute similarity between spoken queries and audio files on acoustic level, and find the query term Audio Word to Vector Application Audio archive divided into variablelength audio segments Off-line Audio Segment to Vector

Spoken Query Audio Segment to Vector On-line Similarity Search Result Audio Word to Vector Application Query-by-Example Spoken Term Detection

SA: sequence auto-encoder MAP DSA: de-noising sequence auto-encoder Input: clean speech + noise output: clean speech training epochs for sequence auto-encoder Next Step Can we include semantics? walk dog

walked cat cats run flower tree http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3 Outline Auto-encoder Deep Learning

Deep Generative Model Conditional Generation Creation http://www.rb139.com/ index.php?s=/Lot/44547 Drawing? Writing Poems? http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3

Outline Auto-encoder Deep Learning Deep Generative Model Component-wised, VAE, GAN Conditional Generation Component-by-component Images are composed of pixels To create an image, generating a pixel each time E.g. 3 x 3 images

RNN RNN RNN Can be trained just with a large collection of images without any annotation Component-by-component - Small images of 792 Pokmon's Can machine learn to create new Pokmons? Don't catch them! Create them! Source of image: http://bulbapedia.bulbagarden.net/wiki/Li

st_of_Pok%C3%A9mon_by_base_stats_(Generation_VI) Original image is 40 x 40 Making them into 20 x 20 Using 1-layer RNN with 512 LSTM cells Real Pokmon Never seen by machine! Cover 50% Cover 75% It is difficult to evaluate generation.

Component-by-component - Drawing from scratch Need some randomness Component-by-component Audio: Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, WaveNet: A Generative Model for Raw Audio, arXiv preprint, 2016 Video: Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, Koray Kavukcuoglu, Video Pixel Networks , arXiv preprint, 2016

http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3 Outline VAE = Variational Auto-Encoder Auto-encoder Deep Learning Deep Generative Model Component-wised, VAE, GAN Conditional Generation

Remember Auto-encoder? As close as possible code NN Encoder code Randomly generate a vector as code NN Decoder

NN Decoder Image ? Remember Auto-encoder? 2D NN Decoder code NN Decoder 1.5

0 [ ] -1.5 [ 1.5 0 1.5 ] NN Decoder

Remember Auto-encoder? -1.5 1.5 Auto-encoder input NN Encoder output code

VAE input NN Decoder NN Encoder Minimize reconstruction error m1 m2

m3 1 exp 2 3 1 X From a normal 2 distribution 3 Auto-Encoding Variational Bayes, https://arxiv.org/abs/1312.6114 +

1 2 3 NN Decoder output = ( ) + 3 Minimize 2 ( ( ) ( 1+ ) +( ) )

=1 Why VAE? Intuitive Reason ? decode code encode noise noise Problems of VAE It does not really try to simulate real images code

NN Decoder Output As close as possible One pixel difference from the target One pixel difference from the target Realistic Fake

http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3 Outline GAN = Generative Adversarial Network Auto-encoder Deep Learning Deep Generative Model Component-wised, VAE, GAN

Conditional Generation Yann LeCuns comment https://www.quora.com/What-are-some-recent-and-potentially-upcoming-breakthroughsin-deep-learning Evolution http://peellden.pixnet.net/blog/post/404068992013-%E7%AC%AC%E5%9B%9B%E5%AD%A3%EF %BC%8C%E5%86%AC%E8%9D%B6%E5%AF %82%E5%AF%A5 Kallima inachus Brown

Butterflies are not brown Butterflies do not have veins veins .. The evolution of generatio n NN Generator v1 NN

Generator v2 NN Generator v3 Discriminator v1 Discriminator v2 Discriminator v3

Real images: GAN Source of images: https://zhuanlan.zhihu.com/p/24767059 DCGAN: https://github.com/carpedm20/DCGAN-tensorflow GAN 100 rounds GAN 1000 rounds GAN

2000 rounds GAN 5000 rounds GAN 10,000 rounds GAN 20,000 rounds GAN

50,000 rounds Basic Idea of GAN The data we want to generate has a distribution ( ) High Probability Image Space Low Probability

Basic Idea of GAN A generator G is a network. The network defines a probability distribution. ( ) Normal Distribution generator G ( ) = ( )

As close as possible It is difficult to compute We do not know what the distribution looks like. https://blog.openai.com/generative-models/ Basic Idea of GAN Normal Distribution ( ) NN Generator

v1 0 0 0 0 1 1 1

1 ( ) image Discriminator v1 1/0 It can be proofed that the loss the discriminator related to JS divergence. Normal

Distribution Basic Idea of GAN Next step: Updating the parameters of generato r To minimize the JS divergence The output be classified as real (as close to 1 as possible) Generator + Discriminator = a network Using gradient descent to update the parameters in the generator, but fix the discriminator

NN Generator v2 v1 Discriminator v1 Original GAN is hard to train . W-GAN 1.0 0.13 http://www.guokr.com/post/773890/

Why GAN is hard to train? Better Why GAN is hard to train? ( ) ( ) 0 ( ) =2 1

? ( ) ( ) 50 100 2 Not really better

( ) ( ) = 2 ( ) ( ) =0 2 Using Wasserstein distance instead of JS divergence WGAN 0

( ) ( ) 0 Better ( ) 1 5 0 ( )

50 ( ) = 0 ( ) =50 2 ( ) 100 ( ) ( ) =0 2

WGAN NN Generator v1 Discriminator v1 NN Generator v2 Discriminator

v2 NN Generator v3 Discriminator v3 Real poems: WGAN

Random generated

So many GANs Just name a few Modifying the Optimization of GAN fGAN WGAN Least-square GAN Loss Sensitive GAN Energy-based GAN Boundary-seeking GAN Unroll GAN Different Structure from the Original GAN Conditional GAN

Semi-supervised GAN InfoGAN BiGAN Cycle GAN Disco GAN VAE-GAN http://yumekui.pixnet.net/album/photo/13915541%E8%B6%85%E5%A4%A7%EF%BC%8E%E7%B7%B4%E6%88%90%E9%99%A3 Outline Auto-encoder Deep Learning Deep Generative Model

Conditional Generation Conditional Generation We dont want to simply generate some random stuff. Generate sentences based on conditions: Caption Generation A dog is running. Given condition: Chat-bot Given condition: Hello

Hello. Nice to see you. Conditional Generation E.g. sentence red hair code NN Encoder code

NN Generator image ? Conditional Need some Generation supervision E.g. sentence red hair

green hair code NN Encoder NN Generator code Conditional Generation

E.g. Red hair, long hair Black hair, blue eyes Blue hair, green eyes Text to Text - Summarizatio n Abstractive Summary: summary Machine learns to do title generation from 2,000,000 tr aining examples title [Yu & Lee, SLT 16

Text to Text - Summarizatio n Summary (short word sequence) x1 x2 x3

y1 y2 y3 y4 xN Input document (long word sequence)

Text to Text - Summarizatio n 22 Document: 15 3 Human: 15 Machine: Document: Human: Machine: Text to Text - Summarizatio

n Demo: Video to Text A girl is running. Video A group of people is knocked by a tree. A group of people is walking in the Video to Text

. (period) a girl Sequence-tosequence learning Video to code Sentence Generator Video to Text Can machine describe what it see from video?

Demo: MTK Image to Text Represent the input condition as a vector, and consider the vector as the input of RNN generator CNN . (period) A code woman

A Image Caption Generation Input image Image to Text Can machine describe what it see from image? Demo: MTK

http://news.ltn.com.tw/photo/politics/ breakingnews/975542_1 To Learn More Machine Learning Slides: http://speech.ee.ntu.edu.tw/~tlkagk/courses_M L16.html Video: https://www.youtube.com/watch?v=fegAeph9U aA&list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49 Machine Learning and Having it Deep and Structured Slides: http://speech.ee.ntu.edu.tw/~tlkagk/courses_M LDS17.html Video: https://www.youtube.com/watch?v=IzHoNwlCG nE&list=PLJV_el3uVTsPMxPbjeX7PicgWbY7F8wW9

Thank you for your atte ntion!

Recently Viewed Presentations

  • AN MTD-BASED SELF-ADAPTIVE RESILIENCE APPROACH FOR CLOUD SYSTEMS

    AN MTD-BASED SELF-ADAPTIVE RESILIENCE APPROACH FOR CLOUD SYSTEMS

    Protect the entire stack through dynamic interval-based spatial randomization. Avoid threats in-time intervals rather than defending the entire runtime of systems through Mobility and Direction. System will start secure, stay secure and return secure. Increase agility, anti-fragility and adaptability of...
  • Elshan Akhadov Spin Electronics Peng Xiong Department of

    Elshan Akhadov Spin Electronics Peng Xiong Department of

    High speed Low density High power consumption Volatile M O S Metal-based Spintronics: Spin valve and magnetic tunnel junction EF N(E) E E H R H M N(E) EF E E H Applications: magnetic sensors, MRAM, NV-logic Spintronics in Semiconductor:...
  • Fish & Fish Productivity - Penn State York

    Fish & Fish Productivity - Penn State York

    Becoming more widespread and including more prominent fish species (salmon, shrimp, seabass) due to improving technology. 58% of all species marine 41% freshwater 1% both (diadromous, euryhaline) Four groupings of fish communities based on lake temperature regime and trophic status.
  • 5B.1 Indirect object pronouns - Northern Highlands

    5B.1 Indirect object pronouns - Northern Highlands

    Punto di partenza In Lezione 5A, you learned that a direct object answers the question what? or whom? An indirect object identifies to whom or for whom an action is done.
  • Main Title Slide — Always use Title Case on Slide Titles

    Main Title Slide — Always use Title Case on Slide Titles

    Arial MS Pゴシック Verdana ヒラギノ角ゴ Pro W3 Wingdings Symbol RTP_08_PPT_SlideGuide_tjd Targeted Intraoperative Radiotherapy versus Whole Breast Radiotherapy for Breast Cancer (TARGIT-A Trial): An International, Prospective, Randomised, Non-Inferiority Phase 3 Trial Introduction Targeted Intraoperative Radiotherapy (TARGIT-A) Study ...
  • Morphology

    Morphology

    Title: Morphology Author: kh Last modified by: Dad Created Date: 11/15/2005 7:44:33 AM Document presentation format: On-screen Show Company: Leeds Grammar School
  • They Didnt Teach You About Signs in School

    They Didnt Teach You About Signs in School

    (Uni. Missouri-Kansas City - Law School n.d.) ... times the height of the sign. Real Estate Open House Sign. No Permit required. Can be place in the ROW, 1 hour before and 1 our after the open house. ... In...
  • Unit 38: Sound Production

    Unit 38: Sound Production

    Types of sound - Non-Diegetic. Non-dietetic sound is diametrical to what diegetic sound is. Non-diegetic sound is neither visible or has been implicatively insinuated to presence in the film. Non-diegetic sound can be either something like; a narrators voice, mood...