Elasticsearch query_string wildcard does not consider length
I have some records on Elasticsearch that have the same first letters, such as: word, worda, wordab, wordabc, wordabcd.
I am using query_string with a wildcard:
"query": {
"bool":{
"must":[
{
"query_string":{
"query":"word*"
}
}
]
}
}
All hits have the same score ("_score" : 1.0), therefore the order is arbitrary. Is it possible to have a score considering how much the word actually matches the term? For instance, word matches the term 100%, worda matches the term 80%, and so on.
elasticsearch
New contributor
add a comment |
I have some records on Elasticsearch that have the same first letters, such as: word, worda, wordab, wordabc, wordabcd.
I am using query_string with a wildcard:
"query": {
"bool":{
"must":[
{
"query_string":{
"query":"word*"
}
}
]
}
}
All hits have the same score ("_score" : 1.0), therefore the order is arbitrary. Is it possible to have a score considering how much the word actually matches the term? For instance, word matches the term 100%, worda matches the term 80%, and so on.
elasticsearch
New contributor
add a comment |
I have some records on Elasticsearch that have the same first letters, such as: word, worda, wordab, wordabc, wordabcd.
I am using query_string with a wildcard:
"query": {
"bool":{
"must":[
{
"query_string":{
"query":"word*"
}
}
]
}
}
All hits have the same score ("_score" : 1.0), therefore the order is arbitrary. Is it possible to have a score considering how much the word actually matches the term? For instance, word matches the term 100%, worda matches the term 80%, and so on.
elasticsearch
New contributor
I have some records on Elasticsearch that have the same first letters, such as: word, worda, wordab, wordabc, wordabcd.
I am using query_string with a wildcard:
"query": {
"bool":{
"must":[
{
"query_string":{
"query":"word*"
}
}
]
}
}
All hits have the same score ("_score" : 1.0), therefore the order is arbitrary. Is it possible to have a score considering how much the word actually matches the term? For instance, word matches the term 100%, worda matches the term 80%, and so on.
elasticsearch
elasticsearch
New contributor
New contributor
New contributor
asked Jan 17 at 21:28
Mauricio BertanhaMauricio Bertanha
11
11
New contributor
New contributor
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
The reason why you get score 1 for all matched docs is the following - wildcard/prefix query are multi term queries and in order for them to be executed, Elasticsearch needs to do a rewrite (to get actual matched terms)
There are several ways to achieve this, the default one is called constant_score
which assigned all constant scores (ones)
There are several different ways to rewrite - some of them will produce non equal scores, but this scoring would be rather rely on TF-IDF distribution of the terms (e.g. how often worda is happening in the matched document and how many documents in whole index contains worda). As a first starting way you could try top_terms_1000
, tweaking it later.
Unfortunately, there is no perfect way out-of-the-box to achieve expected behaviour.
One of the possible ways to mimic it is to try adapt Edge NGram tokenizer to produce tokens from the wordabc as following:
w, wo, wor, word, ...
In this case querying could produce more meaningful score. For perfect expected outcome - percent of the match - you would need to create custom query and scoring mechanism
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Mauricio Bertanha is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54244555%2felasticsearch-query-string-wildcard-does-not-consider-length%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The reason why you get score 1 for all matched docs is the following - wildcard/prefix query are multi term queries and in order for them to be executed, Elasticsearch needs to do a rewrite (to get actual matched terms)
There are several ways to achieve this, the default one is called constant_score
which assigned all constant scores (ones)
There are several different ways to rewrite - some of them will produce non equal scores, but this scoring would be rather rely on TF-IDF distribution of the terms (e.g. how often worda is happening in the matched document and how many documents in whole index contains worda). As a first starting way you could try top_terms_1000
, tweaking it later.
Unfortunately, there is no perfect way out-of-the-box to achieve expected behaviour.
One of the possible ways to mimic it is to try adapt Edge NGram tokenizer to produce tokens from the wordabc as following:
w, wo, wor, word, ...
In this case querying could produce more meaningful score. For perfect expected outcome - percent of the match - you would need to create custom query and scoring mechanism
add a comment |
The reason why you get score 1 for all matched docs is the following - wildcard/prefix query are multi term queries and in order for them to be executed, Elasticsearch needs to do a rewrite (to get actual matched terms)
There are several ways to achieve this, the default one is called constant_score
which assigned all constant scores (ones)
There are several different ways to rewrite - some of them will produce non equal scores, but this scoring would be rather rely on TF-IDF distribution of the terms (e.g. how often worda is happening in the matched document and how many documents in whole index contains worda). As a first starting way you could try top_terms_1000
, tweaking it later.
Unfortunately, there is no perfect way out-of-the-box to achieve expected behaviour.
One of the possible ways to mimic it is to try adapt Edge NGram tokenizer to produce tokens from the wordabc as following:
w, wo, wor, word, ...
In this case querying could produce more meaningful score. For perfect expected outcome - percent of the match - you would need to create custom query and scoring mechanism
add a comment |
The reason why you get score 1 for all matched docs is the following - wildcard/prefix query are multi term queries and in order for them to be executed, Elasticsearch needs to do a rewrite (to get actual matched terms)
There are several ways to achieve this, the default one is called constant_score
which assigned all constant scores (ones)
There are several different ways to rewrite - some of them will produce non equal scores, but this scoring would be rather rely on TF-IDF distribution of the terms (e.g. how often worda is happening in the matched document and how many documents in whole index contains worda). As a first starting way you could try top_terms_1000
, tweaking it later.
Unfortunately, there is no perfect way out-of-the-box to achieve expected behaviour.
One of the possible ways to mimic it is to try adapt Edge NGram tokenizer to produce tokens from the wordabc as following:
w, wo, wor, word, ...
In this case querying could produce more meaningful score. For perfect expected outcome - percent of the match - you would need to create custom query and scoring mechanism
The reason why you get score 1 for all matched docs is the following - wildcard/prefix query are multi term queries and in order for them to be executed, Elasticsearch needs to do a rewrite (to get actual matched terms)
There are several ways to achieve this, the default one is called constant_score
which assigned all constant scores (ones)
There are several different ways to rewrite - some of them will produce non equal scores, but this scoring would be rather rely on TF-IDF distribution of the terms (e.g. how often worda is happening in the matched document and how many documents in whole index contains worda). As a first starting way you could try top_terms_1000
, tweaking it later.
Unfortunately, there is no perfect way out-of-the-box to achieve expected behaviour.
One of the possible ways to mimic it is to try adapt Edge NGram tokenizer to produce tokens from the wordabc as following:
w, wo, wor, word, ...
In this case querying could produce more meaningful score. For perfect expected outcome - percent of the match - you would need to create custom query and scoring mechanism
answered 2 days ago
MysterionMysterion
6,30021942
6,30021942
add a comment |
add a comment |
Mauricio Bertanha is a new contributor. Be nice, and check out our Code of Conduct.
Mauricio Bertanha is a new contributor. Be nice, and check out our Code of Conduct.
Mauricio Bertanha is a new contributor. Be nice, and check out our Code of Conduct.
Mauricio Bertanha is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54244555%2felasticsearch-query-string-wildcard-does-not-consider-length%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown