Elasticsearch Python: Search analyzer

up vote
0
down vote

favorite

I have a DataFrame like this:

df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})

I want Elasticsearch to find all the documents with the similar title as the original title in my DataFrame and calculate the score for each match. in matching titles, year is not important. It means world fair 2018 and world fair 2017 are exactly the same. Below is part of my code but it doesn't give me an efficient matching score.

try:

for event in scan(es,

        query={

      "query": {

        "bool": {

          "must": [

           {"query_string": {"default_field": "data.event.title",

           "query": row['title']}},

          ]}},



        if len(raw_results) < 5:

            raw_results.append(event)

        else:    

            break

    events_list = 

    if len (raw_results) > 0:

        for item in raw_results:



            event_id = item['_id']

            record_score = item['_score']

            event_source = item['_source']['data']['event']

            event_source['event_id'] = event_id

            event_source['score'] = record_score

            events_list.append(event_source)



        dframe2 = pd.DataFrame(events_list)

The output that I get is similar to below:

 FK    title         event_id     title           score

 QFf  World Fair 2018  EfB     world fair         44.2

 QFf  World Fair 2018  n77     world fair         44.1

 QFf  World Fair 2018  5cY     world fair 2017    25.84

 TFzs ZUGER MESSE      NEQ    Styles Messe        20.12

 TFzs ZUGER MESSE      mBc    Hannover Messe      20.06

 TFzs ZUGER MESSE      Y9S    WunderWelten-Messe  20.0

As shown above, in the output, the score of "world fair 2017" dropped because of the year difference. It seems that the score is also affected by the length of the title. I should run the code on about 6 million documents. I would appreciate if anyone helps me how can I improve the above code?

edited 8 hours ago

asked yesterday

Narges

134

add a comment |

up vote
0
down vote

favorite

I have a DataFrame like this:

df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})

try:

for event in scan(es,

        query={

      "query": {

        "bool": {

          "must": [

           {"query_string": {"default_field": "data.event.title",

           "query": row['title']}},

          ]}},



        if len(raw_results) < 5:

            raw_results.append(event)

        else:    

            break

    events_list = 

    if len (raw_results) > 0:

        for item in raw_results:



            event_id = item['_id']

            record_score = item['_score']

            event_source = item['_source']['data']['event']

            event_source['event_id'] = event_id

            event_source['score'] = record_score

            events_list.append(event_source)



        dframe2 = pd.DataFrame(events_list)

The output that I get is similar to below:

 FK    title         event_id     title           score

 QFf  World Fair 2018  EfB     world fair         44.2

 QFf  World Fair 2018  n77     world fair         44.1

 QFf  World Fair 2018  5cY     world fair 2017    25.84

 TFzs ZUGER MESSE      NEQ    Styles Messe        20.12

 TFzs ZUGER MESSE      mBc    Hannover Messe      20.06

 TFzs ZUGER MESSE      Y9S    WunderWelten-Messe  20.0

edited 8 hours ago

asked yesterday

Narges

134

add a comment |

up vote
0
down vote

favorite

I have a DataFrame like this:

df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})

try:

for event in scan(es,

        query={

      "query": {

        "bool": {

          "must": [

           {"query_string": {"default_field": "data.event.title",

           "query": row['title']}},

          ]}},



        if len(raw_results) < 5:

            raw_results.append(event)

        else:    

            break

    events_list = 

    if len (raw_results) > 0:

        for item in raw_results:



            event_id = item['_id']

            record_score = item['_score']

            event_source = item['_source']['data']['event']

            event_source['event_id'] = event_id

            event_source['score'] = record_score

            events_list.append(event_source)



        dframe2 = pd.DataFrame(events_list)

The output that I get is similar to below:

 FK    title         event_id     title           score

 QFf  World Fair 2018  EfB     world fair         44.2

 QFf  World Fair 2018  n77     world fair         44.1

 QFf  World Fair 2018  5cY     world fair 2017    25.84

 TFzs ZUGER MESSE      NEQ    Styles Messe        20.12

 TFzs ZUGER MESSE      mBc    Hannover Messe      20.06

 TFzs ZUGER MESSE      Y9S    WunderWelten-Messe  20.0

edited 8 hours ago

asked yesterday

Narges

134

I have a DataFrame like this:

df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})

try:

for event in scan(es,

        query={

      "query": {

        "bool": {

          "must": [

           {"query_string": {"default_field": "data.event.title",

           "query": row['title']}},

          ]}},



        if len(raw_results) < 5:

            raw_results.append(event)

        else:    

            break

    events_list = 

    if len (raw_results) > 0:

        for item in raw_results:



            event_id = item['_id']

            record_score = item['_score']

            event_source = item['_source']['data']['event']

            event_source['event_id'] = event_id

            event_source['score'] = record_score

            events_list.append(event_source)



        dframe2 = pd.DataFrame(events_list)

The output that I get is similar to below:

 FK    title         event_id     title           score

 QFf  World Fair 2018  EfB     world fair         44.2

 QFf  World Fair 2018  n77     world fair         44.1

 QFf  World Fair 2018  5cY     world fair 2017    25.84

 TFzs ZUGER MESSE      NEQ    Styles Messe        20.12

 TFzs ZUGER MESSE      mBc    Hannover Messe      20.06

 TFzs ZUGER MESSE      Y9S    WunderWelten-Messe  20.0

elasticsearch sentence-similarity

edited 8 hours ago

asked yesterday

Narges

134

edited 8 hours ago

asked yesterday

Narges

134

edited 8 hours ago

asked yesterday

Narges

134

asked yesterday

Narges

134

asked yesterday

Narges

134

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53401652%2felasticsearch-python-search-analyzer%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htykuut