How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for...












1















I have several audios with different duration. So I don't know how to ensure the same number N of segments of the audio. I'm trying to implement an existing paper, so it's said that first a Log Mel-Spectrogram is performed in the whole audio with 64 Mel-filter banks from 20 to 8000 Hz, by using a 25 ms Hamming window and a 10 ms overlapping. Then, in order to get that I have the following code lines:



y, sr = librosa.load(audio_file, sr=None)
#sr = 22050
#len(y) = 237142
#duration = 5.377369614512472

n_mels = 64
n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter
win_length = int(np.ceil(0.025*sr)) # 0.025*22050
hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)
+ 1e-6)

# M.shape = (64, 532)


(Also I'm not sure how to complete that n_fft parameter.)
Then, it's said:




Use a context window of 64 frames to divide the whole log
Mel-spectrogram into audio segments with size 64x64. A shift size of
30 frames is used during the segmentation, i.e. two adjacent segments
are overlapped with 30 frames. Each divided segment hence has a length
of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.




So, I'm stuck in this last part, I don't know how to perform the segmentation of M by 64x64. And how can I got the same numbers of segments for all the audios (with different duration), because at the final I will need 64x64xN features as input to my neural network or classifier? I will appreciate a lot any help! I'm a beginner with audio signal processing.










share|improve this question























  • Which paper are you implementing?

    – jonnor
    Jan 23 at 12:24











  • It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"

    – user2687945
    Jan 23 at 18:07
















1















I have several audios with different duration. So I don't know how to ensure the same number N of segments of the audio. I'm trying to implement an existing paper, so it's said that first a Log Mel-Spectrogram is performed in the whole audio with 64 Mel-filter banks from 20 to 8000 Hz, by using a 25 ms Hamming window and a 10 ms overlapping. Then, in order to get that I have the following code lines:



y, sr = librosa.load(audio_file, sr=None)
#sr = 22050
#len(y) = 237142
#duration = 5.377369614512472

n_mels = 64
n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter
win_length = int(np.ceil(0.025*sr)) # 0.025*22050
hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)
+ 1e-6)

# M.shape = (64, 532)


(Also I'm not sure how to complete that n_fft parameter.)
Then, it's said:




Use a context window of 64 frames to divide the whole log
Mel-spectrogram into audio segments with size 64x64. A shift size of
30 frames is used during the segmentation, i.e. two adjacent segments
are overlapped with 30 frames. Each divided segment hence has a length
of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.




So, I'm stuck in this last part, I don't know how to perform the segmentation of M by 64x64. And how can I got the same numbers of segments for all the audios (with different duration), because at the final I will need 64x64xN features as input to my neural network or classifier? I will appreciate a lot any help! I'm a beginner with audio signal processing.










share|improve this question























  • Which paper are you implementing?

    – jonnor
    Jan 23 at 12:24











  • It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"

    – user2687945
    Jan 23 at 18:07














1












1








1


1






I have several audios with different duration. So I don't know how to ensure the same number N of segments of the audio. I'm trying to implement an existing paper, so it's said that first a Log Mel-Spectrogram is performed in the whole audio with 64 Mel-filter banks from 20 to 8000 Hz, by using a 25 ms Hamming window and a 10 ms overlapping. Then, in order to get that I have the following code lines:



y, sr = librosa.load(audio_file, sr=None)
#sr = 22050
#len(y) = 237142
#duration = 5.377369614512472

n_mels = 64
n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter
win_length = int(np.ceil(0.025*sr)) # 0.025*22050
hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)
+ 1e-6)

# M.shape = (64, 532)


(Also I'm not sure how to complete that n_fft parameter.)
Then, it's said:




Use a context window of 64 frames to divide the whole log
Mel-spectrogram into audio segments with size 64x64. A shift size of
30 frames is used during the segmentation, i.e. two adjacent segments
are overlapped with 30 frames. Each divided segment hence has a length
of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.




So, I'm stuck in this last part, I don't know how to perform the segmentation of M by 64x64. And how can I got the same numbers of segments for all the audios (with different duration), because at the final I will need 64x64xN features as input to my neural network or classifier? I will appreciate a lot any help! I'm a beginner with audio signal processing.










share|improve this question














I have several audios with different duration. So I don't know how to ensure the same number N of segments of the audio. I'm trying to implement an existing paper, so it's said that first a Log Mel-Spectrogram is performed in the whole audio with 64 Mel-filter banks from 20 to 8000 Hz, by using a 25 ms Hamming window and a 10 ms overlapping. Then, in order to get that I have the following code lines:



y, sr = librosa.load(audio_file, sr=None)
#sr = 22050
#len(y) = 237142
#duration = 5.377369614512472

n_mels = 64
n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter
win_length = int(np.ceil(0.025*sr)) # 0.025*22050
hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)
+ 1e-6)

# M.shape = (64, 532)


(Also I'm not sure how to complete that n_fft parameter.)
Then, it's said:




Use a context window of 64 frames to divide the whole log
Mel-spectrogram into audio segments with size 64x64. A shift size of
30 frames is used during the segmentation, i.e. two adjacent segments
are overlapped with 30 frames. Each divided segment hence has a length
of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.




So, I'm stuck in this last part, I don't know how to perform the segmentation of M by 64x64. And how can I got the same numbers of segments for all the audios (with different duration), because at the final I will need 64x64xN features as input to my neural network or classifier? I will appreciate a lot any help! I'm a beginner with audio signal processing.







audio audio-processing spectrogram librosa windowing






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jan 18 at 20:22









user2687945user2687945

83




83













  • Which paper are you implementing?

    – jonnor
    Jan 23 at 12:24











  • It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"

    – user2687945
    Jan 23 at 18:07



















  • Which paper are you implementing?

    – jonnor
    Jan 23 at 12:24











  • It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"

    – user2687945
    Jan 23 at 18:07

















Which paper are you implementing?

– jonnor
Jan 23 at 12:24





Which paper are you implementing?

– jonnor
Jan 23 at 12:24













It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"

– user2687945
Jan 23 at 18:07





It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"

– user2687945
Jan 23 at 18:07












1 Answer
1






active

oldest

votes


















0














Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.



import librosa
import numpy as np
import math

audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds

n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)


window_size = 64
window_hop = 30

# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)

for frame_idx in range(start_frame, end_frame, window_hop):

window = frames[:, frame_idx-window_size:frame_idx]
assert window.shape == (n_mels, window_size)
print('classify window', frame_idx, window.shape)


will output



classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)


However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.






share|improve this answer


























  • Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?

    – user2687945
    Jan 23 at 18:16











  • Do you have labels per segment or for the whole files?

    – jonnor
    2 days ago













  • I set labels for each segment during the training stage.

    – user2687945
    2 days ago











  • Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.

    – jonnor
    yesterday











  • Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!

    – user2687945
    yesterday











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54260929%2fhow-to-use-a-context-window-to-segment-a-whole-log-mel-spectrogram-ensuring-the%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.



import librosa
import numpy as np
import math

audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds

n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)


window_size = 64
window_hop = 30

# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)

for frame_idx in range(start_frame, end_frame, window_hop):

window = frames[:, frame_idx-window_size:frame_idx]
assert window.shape == (n_mels, window_size)
print('classify window', frame_idx, window.shape)


will output



classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)


However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.






share|improve this answer


























  • Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?

    – user2687945
    Jan 23 at 18:16











  • Do you have labels per segment or for the whole files?

    – jonnor
    2 days ago













  • I set labels for each segment during the training stage.

    – user2687945
    2 days ago











  • Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.

    – jonnor
    yesterday











  • Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!

    – user2687945
    yesterday
















0














Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.



import librosa
import numpy as np
import math

audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds

n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)


window_size = 64
window_hop = 30

# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)

for frame_idx in range(start_frame, end_frame, window_hop):

window = frames[:, frame_idx-window_size:frame_idx]
assert window.shape == (n_mels, window_size)
print('classify window', frame_idx, window.shape)


will output



classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)


However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.






share|improve this answer


























  • Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?

    – user2687945
    Jan 23 at 18:16











  • Do you have labels per segment or for the whole files?

    – jonnor
    2 days ago













  • I set labels for each segment during the training stage.

    – user2687945
    2 days ago











  • Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.

    – jonnor
    yesterday











  • Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!

    – user2687945
    yesterday














0












0








0







Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.



import librosa
import numpy as np
import math

audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds

n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)


window_size = 64
window_hop = 30

# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)

for frame_idx in range(start_frame, end_frame, window_hop):

window = frames[:, frame_idx-window_size:frame_idx]
assert window.shape == (n_mels, window_size)
print('classify window', frame_idx, window.shape)


will output



classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)


However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.






share|improve this answer















Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.



import librosa
import numpy as np
import math

audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds

n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)


window_size = 64
window_hop = 30

# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)

for frame_idx in range(start_frame, end_frame, window_hop):

window = frames[:, frame_idx-window_size:frame_idx]
assert window.shape == (n_mels, window_size)
print('classify window', frame_idx, window.shape)


will output



classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)


However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 23 at 12:23

























answered Jan 23 at 11:57









jonnorjonnor

70239




70239













  • Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?

    – user2687945
    Jan 23 at 18:16











  • Do you have labels per segment or for the whole files?

    – jonnor
    2 days ago













  • I set labels for each segment during the training stage.

    – user2687945
    2 days ago











  • Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.

    – jonnor
    yesterday











  • Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!

    – user2687945
    yesterday



















  • Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?

    – user2687945
    Jan 23 at 18:16











  • Do you have labels per segment or for the whole files?

    – jonnor
    2 days ago













  • I set labels for each segment during the training stage.

    – user2687945
    2 days ago











  • Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.

    – jonnor
    yesterday











  • Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!

    – user2687945
    yesterday

















Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?

– user2687945
Jan 23 at 18:16





Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?

– user2687945
Jan 23 at 18:16













Do you have labels per segment or for the whole files?

– jonnor
2 days ago







Do you have labels per segment or for the whole files?

– jonnor
2 days ago















I set labels for each segment during the training stage.

– user2687945
2 days ago





I set labels for each segment during the training stage.

– user2687945
2 days ago













Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.

– jonnor
yesterday





Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.

– jonnor
yesterday













Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!

– user2687945
yesterday





Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!

– user2687945
yesterday


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54260929%2fhow-to-use-a-context-window-to-segment-a-whole-log-mel-spectrogram-ensuring-the%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Homophylophilia

Updating UILabel text programmatically using a function

Cloud Functions - OpenCV Videocapture Read method fails for larger files from cloud storage