How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for...
I have several audios with different duration. So I don't know how to ensure the same number N of segments of the audio. I'm trying to implement an existing paper, so it's said that first a Log Mel-Spectrogram is performed in the whole audio with 64 Mel-filter banks from 20 to 8000 Hz, by using a 25 ms Hamming window and a 10 ms overlapping. Then, in order to get that I have the following code lines:
y, sr = librosa.load(audio_file, sr=None)
#sr = 22050
#len(y) = 237142
#duration = 5.377369614512472
n_mels = 64
n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter
win_length = int(np.ceil(0.025*sr)) # 0.025*22050
hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050
window = 'hamming'
fmin = 20
fmax = 8000
S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)
+ 1e-6)
# M.shape = (64, 532)
(Also I'm not sure how to complete that n_fft parameter.)
Then, it's said:
Use a context window of 64 frames to divide the whole log
Mel-spectrogram into audio segments with size 64x64. A shift size of
30 frames is used during the segmentation, i.e. two adjacent segments
are overlapped with 30 frames. Each divided segment hence has a length
of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.
So, I'm stuck in this last part, I don't know how to perform the segmentation of M by 64x64. And how can I got the same numbers of segments for all the audios (with different duration), because at the final I will need 64x64xN features as input to my neural network or classifier? I will appreciate a lot any help! I'm a beginner with audio signal processing.
audio audio-processing spectrogram librosa windowing
add a comment |
I have several audios with different duration. So I don't know how to ensure the same number N of segments of the audio. I'm trying to implement an existing paper, so it's said that first a Log Mel-Spectrogram is performed in the whole audio with 64 Mel-filter banks from 20 to 8000 Hz, by using a 25 ms Hamming window and a 10 ms overlapping. Then, in order to get that I have the following code lines:
y, sr = librosa.load(audio_file, sr=None)
#sr = 22050
#len(y) = 237142
#duration = 5.377369614512472
n_mels = 64
n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter
win_length = int(np.ceil(0.025*sr)) # 0.025*22050
hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050
window = 'hamming'
fmin = 20
fmax = 8000
S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)
+ 1e-6)
# M.shape = (64, 532)
(Also I'm not sure how to complete that n_fft parameter.)
Then, it's said:
Use a context window of 64 frames to divide the whole log
Mel-spectrogram into audio segments with size 64x64. A shift size of
30 frames is used during the segmentation, i.e. two adjacent segments
are overlapped with 30 frames. Each divided segment hence has a length
of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.
So, I'm stuck in this last part, I don't know how to perform the segmentation of M by 64x64. And how can I got the same numbers of segments for all the audios (with different duration), because at the final I will need 64x64xN features as input to my neural network or classifier? I will appreciate a lot any help! I'm a beginner with audio signal processing.
audio audio-processing spectrogram librosa windowing
Which paper are you implementing?
– jonnor
Jan 23 at 12:24
It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"
– user2687945
Jan 23 at 18:07
add a comment |
I have several audios with different duration. So I don't know how to ensure the same number N of segments of the audio. I'm trying to implement an existing paper, so it's said that first a Log Mel-Spectrogram is performed in the whole audio with 64 Mel-filter banks from 20 to 8000 Hz, by using a 25 ms Hamming window and a 10 ms overlapping. Then, in order to get that I have the following code lines:
y, sr = librosa.load(audio_file, sr=None)
#sr = 22050
#len(y) = 237142
#duration = 5.377369614512472
n_mels = 64
n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter
win_length = int(np.ceil(0.025*sr)) # 0.025*22050
hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050
window = 'hamming'
fmin = 20
fmax = 8000
S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)
+ 1e-6)
# M.shape = (64, 532)
(Also I'm not sure how to complete that n_fft parameter.)
Then, it's said:
Use a context window of 64 frames to divide the whole log
Mel-spectrogram into audio segments with size 64x64. A shift size of
30 frames is used during the segmentation, i.e. two adjacent segments
are overlapped with 30 frames. Each divided segment hence has a length
of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.
So, I'm stuck in this last part, I don't know how to perform the segmentation of M by 64x64. And how can I got the same numbers of segments for all the audios (with different duration), because at the final I will need 64x64xN features as input to my neural network or classifier? I will appreciate a lot any help! I'm a beginner with audio signal processing.
audio audio-processing spectrogram librosa windowing
I have several audios with different duration. So I don't know how to ensure the same number N of segments of the audio. I'm trying to implement an existing paper, so it's said that first a Log Mel-Spectrogram is performed in the whole audio with 64 Mel-filter banks from 20 to 8000 Hz, by using a 25 ms Hamming window and a 10 ms overlapping. Then, in order to get that I have the following code lines:
y, sr = librosa.load(audio_file, sr=None)
#sr = 22050
#len(y) = 237142
#duration = 5.377369614512472
n_mels = 64
n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter
win_length = int(np.ceil(0.025*sr)) # 0.025*22050
hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050
window = 'hamming'
fmin = 20
fmax = 8000
S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)
+ 1e-6)
# M.shape = (64, 532)
(Also I'm not sure how to complete that n_fft parameter.)
Then, it's said:
Use a context window of 64 frames to divide the whole log
Mel-spectrogram into audio segments with size 64x64. A shift size of
30 frames is used during the segmentation, i.e. two adjacent segments
are overlapped with 30 frames. Each divided segment hence has a length
of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.
So, I'm stuck in this last part, I don't know how to perform the segmentation of M by 64x64. And how can I got the same numbers of segments for all the audios (with different duration), because at the final I will need 64x64xN features as input to my neural network or classifier? I will appreciate a lot any help! I'm a beginner with audio signal processing.
audio audio-processing spectrogram librosa windowing
audio audio-processing spectrogram librosa windowing
asked Jan 18 at 20:22
user2687945user2687945
83
83
Which paper are you implementing?
– jonnor
Jan 23 at 12:24
It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"
– user2687945
Jan 23 at 18:07
add a comment |
Which paper are you implementing?
– jonnor
Jan 23 at 12:24
It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"
– user2687945
Jan 23 at 18:07
Which paper are you implementing?
– jonnor
Jan 23 at 12:24
Which paper are you implementing?
– jonnor
Jan 23 at 12:24
It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"
– user2687945
Jan 23 at 18:07
It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"
– user2687945
Jan 23 at 18:07
add a comment |
1 Answer
1
active
oldest
votes
Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.
import librosa
import numpy as np
import math
audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds
n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'
fmin = 20
fmax = 8000
S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)
window_size = 64
window_hop = 30
# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)
for frame_idx in range(start_frame, end_frame, window_hop):
window = frames[:, frame_idx-window_size:frame_idx]
assert window.shape == (n_mels, window_size)
print('classify window', frame_idx, window.shape)
will output
classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)
However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.
Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?
– user2687945
Jan 23 at 18:16
Do you have labels per segment or for the whole files?
– jonnor
2 days ago
I set labels for each segment during the training stage.
– user2687945
2 days ago
Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.
– jonnor
yesterday
Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!
– user2687945
yesterday
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54260929%2fhow-to-use-a-context-window-to-segment-a-whole-log-mel-spectrogram-ensuring-the%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.
import librosa
import numpy as np
import math
audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds
n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'
fmin = 20
fmax = 8000
S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)
window_size = 64
window_hop = 30
# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)
for frame_idx in range(start_frame, end_frame, window_hop):
window = frames[:, frame_idx-window_size:frame_idx]
assert window.shape == (n_mels, window_size)
print('classify window', frame_idx, window.shape)
will output
classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)
However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.
Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?
– user2687945
Jan 23 at 18:16
Do you have labels per segment or for the whole files?
– jonnor
2 days ago
I set labels for each segment during the training stage.
– user2687945
2 days ago
Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.
– jonnor
yesterday
Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!
– user2687945
yesterday
add a comment |
Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.
import librosa
import numpy as np
import math
audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds
n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'
fmin = 20
fmax = 8000
S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)
window_size = 64
window_hop = 30
# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)
for frame_idx in range(start_frame, end_frame, window_hop):
window = frames[:, frame_idx-window_size:frame_idx]
assert window.shape == (n_mels, window_size)
print('classify window', frame_idx, window.shape)
will output
classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)
However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.
Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?
– user2687945
Jan 23 at 18:16
Do you have labels per segment or for the whole files?
– jonnor
2 days ago
I set labels for each segment during the training stage.
– user2687945
2 days ago
Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.
– jonnor
yesterday
Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!
– user2687945
yesterday
add a comment |
Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.
import librosa
import numpy as np
import math
audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds
n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'
fmin = 20
fmax = 8000
S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)
window_size = 64
window_hop = 30
# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)
for frame_idx in range(start_frame, end_frame, window_hop):
window = frames[:, frame_idx-window_size:frame_idx]
assert window.shape == (n_mels, window_size)
print('classify window', frame_idx, window.shape)
will output
classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)
However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.
Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.
import librosa
import numpy as np
import math
audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds
n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'
fmin = 20
fmax = 8000
S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)
window_size = 64
window_hop = 30
# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)
for frame_idx in range(start_frame, end_frame, window_hop):
window = frames[:, frame_idx-window_size:frame_idx]
assert window.shape == (n_mels, window_size)
print('classify window', frame_idx, window.shape)
will output
classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)
However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.
edited Jan 23 at 12:23
answered Jan 23 at 11:57
jonnorjonnor
70239
70239
Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?
– user2687945
Jan 23 at 18:16
Do you have labels per segment or for the whole files?
– jonnor
2 days ago
I set labels for each segment during the training stage.
– user2687945
2 days ago
Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.
– jonnor
yesterday
Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!
– user2687945
yesterday
add a comment |
Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?
– user2687945
Jan 23 at 18:16
Do you have labels per segment or for the whole files?
– jonnor
2 days ago
I set labels for each segment during the training stage.
– user2687945
2 days ago
Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.
– jonnor
yesterday
Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!
– user2687945
yesterday
Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?
– user2687945
Jan 23 at 18:16
Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?
– user2687945
Jan 23 at 18:16
Do you have labels per segment or for the whole files?
– jonnor
2 days ago
Do you have labels per segment or for the whole files?
– jonnor
2 days ago
I set labels for each segment during the training stage.
– user2687945
2 days ago
I set labels for each segment during the training stage.
– user2687945
2 days ago
Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.
– jonnor
yesterday
Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.
– jonnor
yesterday
Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!
– user2687945
yesterday
Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!
– user2687945
yesterday
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54260929%2fhow-to-use-a-context-window-to-segment-a-whole-log-mel-spectrogram-ensuring-the%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Which paper are you implementing?
– jonnor
Jan 23 at 12:24
It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"
– user2687945
Jan 23 at 18:07