Elasticsearch Python: Search analyzer











up vote
0
down vote

favorite












I have a DataFrame like this:



df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})


I want Elasticsearch to find all the documents with the similar title as the original title in my DataFrame and calculate the score for each match. in matching titles, year is not important. It means world fair 2018 and world fair 2017 are exactly the same. Below is part of my code but it doesn't give me an efficient matching score.



try:
for event in scan(es,
query={
"query": {
"bool": {
"must": [
{"query_string": {"default_field": "data.event.title",
"query": row['title']}},
]}},

if len(raw_results) < 5:
raw_results.append(event)
else:
break
events_list =
if len (raw_results) > 0:
for item in raw_results:

event_id = item['_id']
record_score = item['_score']
event_source = item['_source']['data']['event']
event_source['event_id'] = event_id
event_source['score'] = record_score
events_list.append(event_source)

dframe2 = pd.DataFrame(events_list)


The output that I get is similar to below:



 FK    title         event_id     title           score
QFf World Fair 2018 EfB world fair 44.2
QFf World Fair 2018 n77 world fair 44.1
QFf World Fair 2018 5cY world fair 2017 25.84
TFzs ZUGER MESSE NEQ Styles Messe 20.12
TFzs ZUGER MESSE mBc Hannover Messe 20.06
TFzs ZUGER MESSE Y9S WunderWelten-Messe 20.0


As shown above, in the output, the score of "world fair 2017" dropped because of the year difference. It seems that the score is also affected by the length of the title. I should run the code on about 6 million documents. I would appreciate if anyone helps me how can I improve the above code?










share|improve this question




























    up vote
    0
    down vote

    favorite












    I have a DataFrame like this:



    df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})


    I want Elasticsearch to find all the documents with the similar title as the original title in my DataFrame and calculate the score for each match. in matching titles, year is not important. It means world fair 2018 and world fair 2017 are exactly the same. Below is part of my code but it doesn't give me an efficient matching score.



    try:
    for event in scan(es,
    query={
    "query": {
    "bool": {
    "must": [
    {"query_string": {"default_field": "data.event.title",
    "query": row['title']}},
    ]}},

    if len(raw_results) < 5:
    raw_results.append(event)
    else:
    break
    events_list =
    if len (raw_results) > 0:
    for item in raw_results:

    event_id = item['_id']
    record_score = item['_score']
    event_source = item['_source']['data']['event']
    event_source['event_id'] = event_id
    event_source['score'] = record_score
    events_list.append(event_source)

    dframe2 = pd.DataFrame(events_list)


    The output that I get is similar to below:



     FK    title         event_id     title           score
    QFf World Fair 2018 EfB world fair 44.2
    QFf World Fair 2018 n77 world fair 44.1
    QFf World Fair 2018 5cY world fair 2017 25.84
    TFzs ZUGER MESSE NEQ Styles Messe 20.12
    TFzs ZUGER MESSE mBc Hannover Messe 20.06
    TFzs ZUGER MESSE Y9S WunderWelten-Messe 20.0


    As shown above, in the output, the score of "world fair 2017" dropped because of the year difference. It seems that the score is also affected by the length of the title. I should run the code on about 6 million documents. I would appreciate if anyone helps me how can I improve the above code?










    share|improve this question


























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I have a DataFrame like this:



      df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})


      I want Elasticsearch to find all the documents with the similar title as the original title in my DataFrame and calculate the score for each match. in matching titles, year is not important. It means world fair 2018 and world fair 2017 are exactly the same. Below is part of my code but it doesn't give me an efficient matching score.



      try:
      for event in scan(es,
      query={
      "query": {
      "bool": {
      "must": [
      {"query_string": {"default_field": "data.event.title",
      "query": row['title']}},
      ]}},

      if len(raw_results) < 5:
      raw_results.append(event)
      else:
      break
      events_list =
      if len (raw_results) > 0:
      for item in raw_results:

      event_id = item['_id']
      record_score = item['_score']
      event_source = item['_source']['data']['event']
      event_source['event_id'] = event_id
      event_source['score'] = record_score
      events_list.append(event_source)

      dframe2 = pd.DataFrame(events_list)


      The output that I get is similar to below:



       FK    title         event_id     title           score
      QFf World Fair 2018 EfB world fair 44.2
      QFf World Fair 2018 n77 world fair 44.1
      QFf World Fair 2018 5cY world fair 2017 25.84
      TFzs ZUGER MESSE NEQ Styles Messe 20.12
      TFzs ZUGER MESSE mBc Hannover Messe 20.06
      TFzs ZUGER MESSE Y9S WunderWelten-Messe 20.0


      As shown above, in the output, the score of "world fair 2017" dropped because of the year difference. It seems that the score is also affected by the length of the title. I should run the code on about 6 million documents. I would appreciate if anyone helps me how can I improve the above code?










      share|improve this question















      I have a DataFrame like this:



      df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})


      I want Elasticsearch to find all the documents with the similar title as the original title in my DataFrame and calculate the score for each match. in matching titles, year is not important. It means world fair 2018 and world fair 2017 are exactly the same. Below is part of my code but it doesn't give me an efficient matching score.



      try:
      for event in scan(es,
      query={
      "query": {
      "bool": {
      "must": [
      {"query_string": {"default_field": "data.event.title",
      "query": row['title']}},
      ]}},

      if len(raw_results) < 5:
      raw_results.append(event)
      else:
      break
      events_list =
      if len (raw_results) > 0:
      for item in raw_results:

      event_id = item['_id']
      record_score = item['_score']
      event_source = item['_source']['data']['event']
      event_source['event_id'] = event_id
      event_source['score'] = record_score
      events_list.append(event_source)

      dframe2 = pd.DataFrame(events_list)


      The output that I get is similar to below:



       FK    title         event_id     title           score
      QFf World Fair 2018 EfB world fair 44.2
      QFf World Fair 2018 n77 world fair 44.1
      QFf World Fair 2018 5cY world fair 2017 25.84
      TFzs ZUGER MESSE NEQ Styles Messe 20.12
      TFzs ZUGER MESSE mBc Hannover Messe 20.06
      TFzs ZUGER MESSE Y9S WunderWelten-Messe 20.0


      As shown above, in the output, the score of "world fair 2017" dropped because of the year difference. It seems that the score is also affected by the length of the title. I should run the code on about 6 million documents. I would appreciate if anyone helps me how can I improve the above code?







      elasticsearch sentence-similarity






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 8 hours ago

























      asked yesterday









      Narges

      134




      134





























          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














           

          draft saved


          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53401652%2felasticsearch-python-search-analyzer%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown






























          active

          oldest

          votes













          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















           

          draft saved


          draft discarded



















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53401652%2felasticsearch-python-search-analyzer%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Berounka

          Different font size/position of beamer's navigation symbols template's content depending on regular/plain...

          Sphinx de Gizeh