Elasticsearch Python: Search analyzer
up vote
0
down vote
favorite
I have a DataFrame like this:
df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})
I want Elasticsearch to find all the documents with the similar title as the original title in my DataFrame and calculate the score for each match. in matching titles, year is not important. It means world fair 2018 and world fair 2017 are exactly the same. Below is part of my code but it doesn't give me an efficient matching score.
try:
for event in scan(es,
        query={
      "query": {
        "bool": {
          "must": [
           {"query_string": {"default_field": "data.event.title",
           "query": row['title']}},
          ]}},
        if len(raw_results) < 5:
            raw_results.append(event)
        else:    
            break
    events_list = 
    if len (raw_results) > 0:
        for item in raw_results:
            event_id = item['_id']
            record_score = item['_score']
            event_source = item['_source']['data']['event']
            event_source['event_id'] = event_id
            event_source['score'] = record_score
            events_list.append(event_source)
        dframe2 = pd.DataFrame(events_list)  
The output that I get is similar to below:
 FK    title         event_id     title           score
 QFf  World Fair 2018  EfB     world fair         44.2
 QFf  World Fair 2018  n77     world fair         44.1
 QFf  World Fair 2018  5cY     world fair 2017    25.84
 TFzs ZUGER MESSE      NEQ    Styles Messe        20.12
 TFzs ZUGER MESSE      mBc    Hannover Messe      20.06
 TFzs ZUGER MESSE      Y9S    WunderWelten-Messe  20.0  
As shown above, in the output, the score of "world fair 2017" dropped because of the year difference. It seems that the score is also affected by the length of the title. I should run the code on about 6 million documents. I would appreciate if anyone helps me how can I improve the above code?
add a comment |
up vote
0
down vote
favorite
I have a DataFrame like this:
df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})
I want Elasticsearch to find all the documents with the similar title as the original title in my DataFrame and calculate the score for each match. in matching titles, year is not important. It means world fair 2018 and world fair 2017 are exactly the same. Below is part of my code but it doesn't give me an efficient matching score.
try:
for event in scan(es,
        query={
      "query": {
        "bool": {
          "must": [
           {"query_string": {"default_field": "data.event.title",
           "query": row['title']}},
          ]}},
        if len(raw_results) < 5:
            raw_results.append(event)
        else:    
            break
    events_list = 
    if len (raw_results) > 0:
        for item in raw_results:
            event_id = item['_id']
            record_score = item['_score']
            event_source = item['_source']['data']['event']
            event_source['event_id'] = event_id
            event_source['score'] = record_score
            events_list.append(event_source)
        dframe2 = pd.DataFrame(events_list)  
The output that I get is similar to below:
 FK    title         event_id     title           score
 QFf  World Fair 2018  EfB     world fair         44.2
 QFf  World Fair 2018  n77     world fair         44.1
 QFf  World Fair 2018  5cY     world fair 2017    25.84
 TFzs ZUGER MESSE      NEQ    Styles Messe        20.12
 TFzs ZUGER MESSE      mBc    Hannover Messe      20.06
 TFzs ZUGER MESSE      Y9S    WunderWelten-Messe  20.0  
As shown above, in the output, the score of "world fair 2017" dropped because of the year difference. It seems that the score is also affected by the length of the title. I should run the code on about 6 million documents. I would appreciate if anyone helps me how can I improve the above code?
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have a DataFrame like this:
df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})
I want Elasticsearch to find all the documents with the similar title as the original title in my DataFrame and calculate the score for each match. in matching titles, year is not important. It means world fair 2018 and world fair 2017 are exactly the same. Below is part of my code but it doesn't give me an efficient matching score.
try:
for event in scan(es,
        query={
      "query": {
        "bool": {
          "must": [
           {"query_string": {"default_field": "data.event.title",
           "query": row['title']}},
          ]}},
        if len(raw_results) < 5:
            raw_results.append(event)
        else:    
            break
    events_list = 
    if len (raw_results) > 0:
        for item in raw_results:
            event_id = item['_id']
            record_score = item['_score']
            event_source = item['_source']['data']['event']
            event_source['event_id'] = event_id
            event_source['score'] = record_score
            events_list.append(event_source)
        dframe2 = pd.DataFrame(events_list)  
The output that I get is similar to below:
 FK    title         event_id     title           score
 QFf  World Fair 2018  EfB     world fair         44.2
 QFf  World Fair 2018  n77     world fair         44.1
 QFf  World Fair 2018  5cY     world fair 2017    25.84
 TFzs ZUGER MESSE      NEQ    Styles Messe        20.12
 TFzs ZUGER MESSE      mBc    Hannover Messe      20.06
 TFzs ZUGER MESSE      Y9S    WunderWelten-Messe  20.0  
As shown above, in the output, the score of "world fair 2017" dropped because of the year difference. It seems that the score is also affected by the length of the title. I should run the code on about 6 million documents. I would appreciate if anyone helps me how can I improve the above code?
I have a DataFrame like this:
df = pd.DataFrame({'FK':['QFf','TFzs'],'title':['World Fair 2018','ZUGER MESSE']})
I want Elasticsearch to find all the documents with the similar title as the original title in my DataFrame and calculate the score for each match. in matching titles, year is not important. It means world fair 2018 and world fair 2017 are exactly the same. Below is part of my code but it doesn't give me an efficient matching score.
try:
for event in scan(es,
        query={
      "query": {
        "bool": {
          "must": [
           {"query_string": {"default_field": "data.event.title",
           "query": row['title']}},
          ]}},
        if len(raw_results) < 5:
            raw_results.append(event)
        else:    
            break
    events_list = 
    if len (raw_results) > 0:
        for item in raw_results:
            event_id = item['_id']
            record_score = item['_score']
            event_source = item['_source']['data']['event']
            event_source['event_id'] = event_id
            event_source['score'] = record_score
            events_list.append(event_source)
        dframe2 = pd.DataFrame(events_list)  
The output that I get is similar to below:
 FK    title         event_id     title           score
 QFf  World Fair 2018  EfB     world fair         44.2
 QFf  World Fair 2018  n77     world fair         44.1
 QFf  World Fair 2018  5cY     world fair 2017    25.84
 TFzs ZUGER MESSE      NEQ    Styles Messe        20.12
 TFzs ZUGER MESSE      mBc    Hannover Messe      20.06
 TFzs ZUGER MESSE      Y9S    WunderWelten-Messe  20.0  
As shown above, in the output, the score of "world fair 2017" dropped because of the year difference. It seems that the score is also affected by the length of the title. I should run the code on about 6 million documents. I would appreciate if anyone helps me how can I improve the above code?
edited 8 hours ago
asked yesterday
Narges
134
134
add a comment |
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53401652%2felasticsearch-python-search-analyzer%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown