Elasticsearch query_string wildcard does not consider length












0















I have some records on Elasticsearch that have the same first letters, such as: word, worda, wordab, wordabc, wordabcd.



I am using query_string with a wildcard:



"query": {
"bool":{
"must":[
{
"query_string":{
"query":"word*"
}
}
]
}
}


All hits have the same score ("_score" : 1.0), therefore the order is arbitrary. Is it possible to have a score considering how much the word actually matches the term? For instance, word matches the term 100%, worda matches the term 80%, and so on.










share|improve this question







New contributor




Mauricio Bertanha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

























    0















    I have some records on Elasticsearch that have the same first letters, such as: word, worda, wordab, wordabc, wordabcd.



    I am using query_string with a wildcard:



    "query": {
    "bool":{
    "must":[
    {
    "query_string":{
    "query":"word*"
    }
    }
    ]
    }
    }


    All hits have the same score ("_score" : 1.0), therefore the order is arbitrary. Is it possible to have a score considering how much the word actually matches the term? For instance, word matches the term 100%, worda matches the term 80%, and so on.










    share|improve this question







    New contributor




    Mauricio Bertanha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.























      0












      0








      0








      I have some records on Elasticsearch that have the same first letters, such as: word, worda, wordab, wordabc, wordabcd.



      I am using query_string with a wildcard:



      "query": {
      "bool":{
      "must":[
      {
      "query_string":{
      "query":"word*"
      }
      }
      ]
      }
      }


      All hits have the same score ("_score" : 1.0), therefore the order is arbitrary. Is it possible to have a score considering how much the word actually matches the term? For instance, word matches the term 100%, worda matches the term 80%, and so on.










      share|improve this question







      New contributor




      Mauricio Bertanha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.












      I have some records on Elasticsearch that have the same first letters, such as: word, worda, wordab, wordabc, wordabcd.



      I am using query_string with a wildcard:



      "query": {
      "bool":{
      "must":[
      {
      "query_string":{
      "query":"word*"
      }
      }
      ]
      }
      }


      All hits have the same score ("_score" : 1.0), therefore the order is arbitrary. Is it possible to have a score considering how much the word actually matches the term? For instance, word matches the term 100%, worda matches the term 80%, and so on.







      elasticsearch






      share|improve this question







      New contributor




      Mauricio Bertanha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      Mauricio Bertanha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      Mauricio Bertanha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked Jan 17 at 21:28









      Mauricio BertanhaMauricio Bertanha

      11




      11




      New contributor




      Mauricio Bertanha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Mauricio Bertanha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Mauricio Bertanha is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.
























          1 Answer
          1






          active

          oldest

          votes


















          0














          The reason why you get score 1 for all matched docs is the following - wildcard/prefix query are multi term queries and in order for them to be executed, Elasticsearch needs to do a rewrite (to get actual matched terms)



          There are several ways to achieve this, the default one is called constant_score which assigned all constant scores (ones)



          There are several different ways to rewrite - some of them will produce non equal scores, but this scoring would be rather rely on TF-IDF distribution of the terms (e.g. how often worda is happening in the matched document and how many documents in whole index contains worda). As a first starting way you could try top_terms_1000, tweaking it later.



          Unfortunately, there is no perfect way out-of-the-box to achieve expected behaviour.



          One of the possible ways to mimic it is to try adapt Edge NGram tokenizer to produce tokens from the wordabc as following:



          w, wo, wor, word, ...


          In this case querying could produce more meaningful score. For perfect expected outcome - percent of the match - you would need to create custom query and scoring mechanism






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });






            Mauricio Bertanha is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54244555%2felasticsearch-query-string-wildcard-does-not-consider-length%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            The reason why you get score 1 for all matched docs is the following - wildcard/prefix query are multi term queries and in order for them to be executed, Elasticsearch needs to do a rewrite (to get actual matched terms)



            There are several ways to achieve this, the default one is called constant_score which assigned all constant scores (ones)



            There are several different ways to rewrite - some of them will produce non equal scores, but this scoring would be rather rely on TF-IDF distribution of the terms (e.g. how often worda is happening in the matched document and how many documents in whole index contains worda). As a first starting way you could try top_terms_1000, tweaking it later.



            Unfortunately, there is no perfect way out-of-the-box to achieve expected behaviour.



            One of the possible ways to mimic it is to try adapt Edge NGram tokenizer to produce tokens from the wordabc as following:



            w, wo, wor, word, ...


            In this case querying could produce more meaningful score. For perfect expected outcome - percent of the match - you would need to create custom query and scoring mechanism






            share|improve this answer




























              0














              The reason why you get score 1 for all matched docs is the following - wildcard/prefix query are multi term queries and in order for them to be executed, Elasticsearch needs to do a rewrite (to get actual matched terms)



              There are several ways to achieve this, the default one is called constant_score which assigned all constant scores (ones)



              There are several different ways to rewrite - some of them will produce non equal scores, but this scoring would be rather rely on TF-IDF distribution of the terms (e.g. how often worda is happening in the matched document and how many documents in whole index contains worda). As a first starting way you could try top_terms_1000, tweaking it later.



              Unfortunately, there is no perfect way out-of-the-box to achieve expected behaviour.



              One of the possible ways to mimic it is to try adapt Edge NGram tokenizer to produce tokens from the wordabc as following:



              w, wo, wor, word, ...


              In this case querying could produce more meaningful score. For perfect expected outcome - percent of the match - you would need to create custom query and scoring mechanism






              share|improve this answer


























                0












                0








                0







                The reason why you get score 1 for all matched docs is the following - wildcard/prefix query are multi term queries and in order for them to be executed, Elasticsearch needs to do a rewrite (to get actual matched terms)



                There are several ways to achieve this, the default one is called constant_score which assigned all constant scores (ones)



                There are several different ways to rewrite - some of them will produce non equal scores, but this scoring would be rather rely on TF-IDF distribution of the terms (e.g. how often worda is happening in the matched document and how many documents in whole index contains worda). As a first starting way you could try top_terms_1000, tweaking it later.



                Unfortunately, there is no perfect way out-of-the-box to achieve expected behaviour.



                One of the possible ways to mimic it is to try adapt Edge NGram tokenizer to produce tokens from the wordabc as following:



                w, wo, wor, word, ...


                In this case querying could produce more meaningful score. For perfect expected outcome - percent of the match - you would need to create custom query and scoring mechanism






                share|improve this answer













                The reason why you get score 1 for all matched docs is the following - wildcard/prefix query are multi term queries and in order for them to be executed, Elasticsearch needs to do a rewrite (to get actual matched terms)



                There are several ways to achieve this, the default one is called constant_score which assigned all constant scores (ones)



                There are several different ways to rewrite - some of them will produce non equal scores, but this scoring would be rather rely on TF-IDF distribution of the terms (e.g. how often worda is happening in the matched document and how many documents in whole index contains worda). As a first starting way you could try top_terms_1000, tweaking it later.



                Unfortunately, there is no perfect way out-of-the-box to achieve expected behaviour.



                One of the possible ways to mimic it is to try adapt Edge NGram tokenizer to produce tokens from the wordabc as following:



                w, wo, wor, word, ...


                In this case querying could produce more meaningful score. For perfect expected outcome - percent of the match - you would need to create custom query and scoring mechanism







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered 2 days ago









                MysterionMysterion

                6,30021942




                6,30021942






















                    Mauricio Bertanha is a new contributor. Be nice, and check out our Code of Conduct.










                    draft saved

                    draft discarded


















                    Mauricio Bertanha is a new contributor. Be nice, and check out our Code of Conduct.













                    Mauricio Bertanha is a new contributor. Be nice, and check out our Code of Conduct.












                    Mauricio Bertanha is a new contributor. Be nice, and check out our Code of Conduct.
















                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54244555%2felasticsearch-query-string-wildcard-does-not-consider-length%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How fix org.hibernate.TransientPropertyValueException

                    Updating UILabel text programmatically using a function

                    Cloud Functions - OpenCV Videocapture Read method fails for larger files from cloud storage