How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for...

I have several audios with different duration. So I don't know how to ensure the same number N of segments of the audio. I'm trying to implement an existing paper, so it's said that first a Log Mel-Spectrogram is performed in the whole audio with 64 Mel-filter banks from 20 to 8000 Hz, by using a 25 ms Hamming window and a 10 ms overlapping. Then, in order to get that I have the following code lines:

y, sr = librosa.load(audio_file, sr=None)

#sr = 22050

#len(y) = 237142

#duration = 5.377369614512472



n_mels = 64

n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter

win_length = int(np.ceil(0.025*sr)) # 0.025*22050

hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050

window = 'hamming'



fmin = 20

fmax = 8000



S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)

M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)

+ 1e-6)



# M.shape = (64, 532)

(Also I'm not sure how to complete that n_fft parameter.)
Then, it's said:

Use a context window of 64 frames to divide the whole log
Mel-spectrogram into audio segments with size 64x64. A shift size of
30 frames is used during the segmentation, i.e. two adjacent segments
are overlapped with 30 frames. Each divided segment hence has a length
of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.

So, I'm stuck in this last part, I don't know how to perform the segmentation of M by 64x64. And how can I got the same numbers of segments for all the audios (with different duration), because at the final I will need 64x64xN features as input to my neural network or classifier? I will appreciate a lot any help! I'm a beginner with audio signal processing.

asked Jan 18 at 20:22

user2687945

Which paper are you implementing?

– jonnor
Jan 23 at 12:24

It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"

– user2687945
Jan 23 at 18:07

add a comment |

y, sr = librosa.load(audio_file, sr=None)

#sr = 22050

#len(y) = 237142

#duration = 5.377369614512472



n_mels = 64

n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter

win_length = int(np.ceil(0.025*sr)) # 0.025*22050

hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050

window = 'hamming'



fmin = 20

fmax = 8000



S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)

M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)

+ 1e-6)



# M.shape = (64, 532)

(Also I'm not sure how to complete that n_fft parameter.)
Then, it's said:

Use a context window of 64 frames to divide the whole log
Mel-spectrogram into audio segments with size 64x64. A shift size of
30 frames is used during the segmentation, i.e. two adjacent segments
are overlapped with 30 frames. Each divided segment hence has a length
of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.

asked Jan 18 at 20:22

user2687945

Which paper are you implementing?

– jonnor
Jan 23 at 12:24

It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"

– user2687945
Jan 23 at 18:07

add a comment |

y, sr = librosa.load(audio_file, sr=None)

#sr = 22050

#len(y) = 237142

#duration = 5.377369614512472



n_mels = 64

n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter

win_length = int(np.ceil(0.025*sr)) # 0.025*22050

hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050

window = 'hamming'



fmin = 20

fmax = 8000



S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)

M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)

+ 1e-6)



# M.shape = (64, 532)

(Also I'm not sure how to complete that n_fft parameter.)
Then, it's said:

Use a context window of 64 frames to divide the whole log
Mel-spectrogram into audio segments with size 64x64. A shift size of
30 frames is used during the segmentation, i.e. two adjacent segments
are overlapped with 30 frames. Each divided segment hence has a length
of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.

asked Jan 18 at 20:22

user2687945

y, sr = librosa.load(audio_file, sr=None)

#sr = 22050

#len(y) = 237142

#duration = 5.377369614512472



n_mels = 64

n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter

win_length = int(np.ceil(0.025*sr)) # 0.025*22050

hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050

window = 'hamming'



fmin = 20

fmax = 8000



S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)

M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)

+ 1e-6)



# M.shape = (64, 532)

(Also I'm not sure how to complete that n_fft parameter.)
Then, it's said:

Use a context window of 64 frames to divide the whole log
Mel-spectrogram into audio segments with size 64x64. A shift size of
30 frames is used during the segmentation, i.e. two adjacent segments
are overlapped with 30 frames. Each divided segment hence has a length
of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.

audio audio-processing spectrogram librosa windowing

asked Jan 18 at 20:22

user2687945

asked Jan 18 at 20:22

user2687945

asked Jan 18 at 20:22

user2687945

asked Jan 18 at 20:22

user2687945

asked Jan 18 at 20:22

user2687945

Which paper are you implementing?

– jonnor
Jan 23 at 12:24

It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"

– user2687945
Jan 23 at 18:07

add a comment |

Which paper are you implementing?

– jonnor
Jan 23 at 12:24

It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"

– user2687945
Jan 23 at 18:07

Which paper are you implementing?

– jonnor
Jan 23 at 12:24

It's "Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition"

– user2687945
Jan 23 at 18:07

add a comment |

1 Answer
1

active

oldest

votes

Loop over the frames along the time axis, moving forward 30 frames at a time, and extracting a window of last 64 frames. At the start and end you need to either truncate or pad the data to get full frames.

import librosa

import numpy as np

import math



audio_file = librosa.util.example_audio_file()

y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds



n_mels = 64

n_fft = int(np.ceil(0.025*sr))

win_length = int(np.ceil(0.025*sr))

hop_length = int(np.ceil(0.010*sr))

window = 'hamming'



fmin = 20

fmax = 8000



S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)

frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)





window_size = 64

window_hop = 30



# truncate at start and end to only have windows full data

# alternative would be to zero-pad

start_frame = window_size 

end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)



for frame_idx in range(start_frame, end_frame, window_hop):



    window = frames[:, frame_idx-window_size:frame_idx]

    assert window.shape == (n_mels, window_size)

    print('classify window', frame_idx, window.shape)

will output

classify window 64 (64, 64)

classify window 94 (64, 64)

classify window 124 (64, 64)

...

classify window 454 (64, 64)

However the number of windows will depend on the length of the audio sample. So if it is important to only have the same number of windows, you need to make sure all audio samples are the same length.

edited Jan 23 at 12:23

answered Jan 23 at 11:57

jonnor

70239

Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?

– user2687945
Jan 23 at 18:16

Do you have labels per segment or for the whole files?

– jonnor
2 days ago

I set labels for each segment during the training stage.

– user2687945
2 days ago

Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.

– jonnor
yesterday

Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!

– user2687945
yesterday

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54260929%2fhow-to-use-a-context-window-to-segment-a-whole-log-mel-spectrogram-ensuring-the%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

import librosa

import numpy as np

import math



audio_file = librosa.util.example_audio_file()

y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds



n_mels = 64

n_fft = int(np.ceil(0.025*sr))

win_length = int(np.ceil(0.025*sr))

hop_length = int(np.ceil(0.010*sr))

window = 'hamming'



fmin = 20

fmax = 8000



S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)

frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)





window_size = 64

window_hop = 30



# truncate at start and end to only have windows full data

# alternative would be to zero-pad

start_frame = window_size 

end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)



for frame_idx in range(start_frame, end_frame, window_hop):



    window = frames[:, frame_idx-window_size:frame_idx]

    assert window.shape == (n_mels, window_size)

    print('classify window', frame_idx, window.shape)

will output

classify window 64 (64, 64)

classify window 94 (64, 64)

classify window 124 (64, 64)

...

classify window 454 (64, 64)

edited Jan 23 at 12:23

answered Jan 23 at 11:57

jonnor

70239

Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?

– user2687945
Jan 23 at 18:16

Do you have labels per segment or for the whole files?

– jonnor
2 days ago

I set labels for each segment during the training stage.

– user2687945
2 days ago

Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.

– jonnor
yesterday

Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!

– user2687945
yesterday

add a comment |

import librosa

import numpy as np

import math



audio_file = librosa.util.example_audio_file()

y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds



n_mels = 64

n_fft = int(np.ceil(0.025*sr))

win_length = int(np.ceil(0.025*sr))

hop_length = int(np.ceil(0.010*sr))

window = 'hamming'



fmin = 20

fmax = 8000



S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)

frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)





window_size = 64

window_hop = 30



# truncate at start and end to only have windows full data

# alternative would be to zero-pad

start_frame = window_size 

end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)



for frame_idx in range(start_frame, end_frame, window_hop):



    window = frames[:, frame_idx-window_size:frame_idx]

    assert window.shape == (n_mels, window_size)

    print('classify window', frame_idx, window.shape)

will output

classify window 64 (64, 64)

classify window 94 (64, 64)

classify window 124 (64, 64)

...

classify window 454 (64, 64)

edited Jan 23 at 12:23

answered Jan 23 at 11:57

jonnor

70239

Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?

– user2687945
Jan 23 at 18:16

Do you have labels per segment or for the whole files?

– jonnor
2 days ago

I set labels for each segment during the training stage.

– user2687945
2 days ago

Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.

– jonnor
yesterday

Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!

– user2687945
yesterday

add a comment |

import librosa

import numpy as np

import math



audio_file = librosa.util.example_audio_file()

y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds



n_mels = 64

n_fft = int(np.ceil(0.025*sr))

win_length = int(np.ceil(0.025*sr))

hop_length = int(np.ceil(0.010*sr))

window = 'hamming'



fmin = 20

fmax = 8000



S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)

frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)





window_size = 64

window_hop = 30



# truncate at start and end to only have windows full data

# alternative would be to zero-pad

start_frame = window_size 

end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)



for frame_idx in range(start_frame, end_frame, window_hop):



    window = frames[:, frame_idx-window_size:frame_idx]

    assert window.shape == (n_mels, window_size)

    print('classify window', frame_idx, window.shape)

will output

classify window 64 (64, 64)

classify window 94 (64, 64)

classify window 124 (64, 64)

...

classify window 454 (64, 64)

edited Jan 23 at 12:23

answered Jan 23 at 11:57

jonnor

70239

import librosa

import numpy as np

import math



audio_file = librosa.util.example_audio_file()

y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds



n_mels = 64

n_fft = int(np.ceil(0.025*sr))

win_length = int(np.ceil(0.025*sr))

hop_length = int(np.ceil(0.010*sr))

window = 'hamming'



fmin = 20

fmax = 8000



S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)

frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)





window_size = 64

window_hop = 30



# truncate at start and end to only have windows full data

# alternative would be to zero-pad

start_frame = window_size 

end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)



for frame_idx in range(start_frame, end_frame, window_hop):



    window = frames[:, frame_idx-window_size:frame_idx]

    assert window.shape == (n_mels, window_size)

    print('classify window', frame_idx, window.shape)

will output

classify window 64 (64, 64)

classify window 94 (64, 64)

classify window 124 (64, 64)

...

classify window 454 (64, 64)

edited Jan 23 at 12:23

answered Jan 23 at 11:57

jonnor

70239

edited Jan 23 at 12:23

answered Jan 23 at 11:57

jonnor

70239

answered Jan 23 at 11:57

jonnor

70239

answered Jan 23 at 11:57

jonnor

70239

Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?

– user2687945
Jan 23 at 18:16

Do you have labels per segment or for the whole files?

– jonnor
2 days ago

I set labels for each segment during the training stage.

– user2687945
2 days ago

Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.

– jonnor
yesterday

Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!

– user2687945
yesterday

add a comment |

Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?

– user2687945
Jan 23 at 18:16

Do you have labels per segment or for the whole files?

– jonnor
2 days ago

I set labels for each segment during the training stage.

– user2687945
2 days ago

Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.

– jonnor
yesterday

Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!

– user2687945
yesterday

Thank you so much for the clarification! However, how could I classify different segment sets, with different feature lengths? I am using now all the segments to train a CNN model. Then, how could I fuse/join all of them for classification?

– user2687945
Jan 23 at 18:16

Do you have labels per segment or for the whole files?

– jonnor
2 days ago

I set labels for each segment during the training stage.

– user2687945
2 days ago

Do the different segments in the file have different labels, or do they all 'inherit' the label of the file? If the latter then you can use Multi Instance Learning to train all segments against the label of the file.

– jonnor
yesterday

Actually, all inherit the label of the file! Thank you for your suggestion! I will try to follow that approach!

– user2687945
yesterday

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Brtdku