How should look like a morphline for MapReduceIndexerTool?











up vote
1
down vote

favorite












I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.



For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.



Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.



I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?



I can't find any example on a morphline adapted for this map reduce job.



Thank you respectfully.










share|improve this question






















  • Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
    – cricket_007
    Mar 5 at 13:44










  • Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.
    – gyan
    Mar 5 at 16:22















up vote
1
down vote

favorite












I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.



For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.



Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.



I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?



I can't find any example on a morphline adapted for this map reduce job.



Thank you respectfully.










share|improve this question






















  • Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
    – cricket_007
    Mar 5 at 13:44










  • Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.
    – gyan
    Mar 5 at 16:22













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.



For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.



Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.



I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?



I can't find any example on a morphline adapted for this map reduce job.



Thank you respectfully.










share|improve this question













I want to search through a lot of logs (about 1 TB in size, placed on multiple machines) efficiently.



For that purpose, I want to build an infrastructure composed of Flume, Hadoop and Solr. Flume will get the logs from a couple of machines and will put them into HDFS.



Now, I want to be able to index those logs using a map reduce job in order to be able to search through them using Solr. I found that MapReduceIndexerTool does this for me, but I see that it needs a morphline.



I know that a morphline, in general, performs a set of operations on the data it takes but what kind of operations should I perform if I want to use the MapReduceIndexerTool?



I can't find any example on a morphline adapted for this map reduce job.



Thank you respectfully.







hadoop mapreduce morphline






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Mar 5 at 12:35









Cosmin Ioniță

565830




565830












  • Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
    – cricket_007
    Mar 5 at 13:44










  • Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.
    – gyan
    Mar 5 at 16:22


















  • Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
    – cricket_007
    Mar 5 at 13:44










  • Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.
    – gyan
    Mar 5 at 16:22
















Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
– cricket_007
Mar 5 at 13:44




Find link inside this section flume.apache.org/FlumeUserGuide.html#morphlinesolrsink
– cricket_007
Mar 5 at 13:44












Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.
– gyan
Mar 5 at 16:22




Have added a reference to cloudera doc, which is having similar use case example. Hope it helps.
– gyan
Mar 5 at 16:22












2 Answers
2






active

oldest

votes

















up vote
1
down vote



accepted










Cloudera has a guide which is having almost similar use case given under morphline.



enter image description here




In this figure, a Flume Source receives syslog events and sends them
to a Flume Morphline Sink, which converts each Flume event to a record
and pipes it into a readLine command. The readLine command extracts
the log line and pipes it into a grok command. The grok command uses
regular expression pattern matching to extract some substrings of the
line. It pipes the resulting structured record into the loadSolr
command. Finally, the loadSolr command loads the record into Solr,
typically a SolrCloud. In the process, raw data or semi-structured
data is transformed into structured data according to application
modelling requirements.




The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:



enter image description here






share|improve this answer





















  • This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command
    – cricket_007
    Mar 5 at 23:04












  • @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
    – gyan
    Mar 6 at 7:05










  • Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
    – Cosmin Ioniță
    Mar 6 at 8:48










  • morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.
    – gyan
    Mar 6 at 9:02






  • 1




    okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
    – Cosmin Ioniță
    Mar 6 at 10:03


















up vote
1
down vote













In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.



For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:



SOLR_LOCATOR : {
collection : collection1
zkHost : "127.0.0.1:2181/solr"
}
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
{
readAvroContainer {}
}
{
extractAvroPaths {
flatten : false
paths : {
id : /id
field1_s : /field1
field2_s : /field2
}
}
}
{
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
}
{
loadSolr {
solrLocator : ${SOLR_LOCATOR}
}
}
]
}
]


When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.



This is command I'm using to index files and merge them to running collection:



sudo -u hdfs hadoop --config /etc/hadoop/conf 
jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
--morphline-file /local/path/morphlines_file
--output-dir hdfs://localhost/mrit/out
--zk-host localhost:2181/solr
--collection collection1
--go-live
hdfs:/mrit/in/my-avro-file.avro


Solr should be configured to work with HDFS and collection should exist.



All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f49110654%2fhow-should-look-like-a-morphline-for-mapreduceindexertool%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    1
    down vote



    accepted










    Cloudera has a guide which is having almost similar use case given under morphline.



    enter image description here




    In this figure, a Flume Source receives syslog events and sends them
    to a Flume Morphline Sink, which converts each Flume event to a record
    and pipes it into a readLine command. The readLine command extracts
    the log line and pipes it into a grok command. The grok command uses
    regular expression pattern matching to extract some substrings of the
    line. It pipes the resulting structured record into the loadSolr
    command. Finally, the loadSolr command loads the record into Solr,
    typically a SolrCloud. In the process, raw data or semi-structured
    data is transformed into structured data according to application
    modelling requirements.




    The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:



    enter image description here






    share|improve this answer





















    • This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command
      – cricket_007
      Mar 5 at 23:04












    • @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
      – gyan
      Mar 6 at 7:05










    • Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
      – Cosmin Ioniță
      Mar 6 at 8:48










    • morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.
      – gyan
      Mar 6 at 9:02






    • 1




      okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
      – Cosmin Ioniță
      Mar 6 at 10:03















    up vote
    1
    down vote



    accepted










    Cloudera has a guide which is having almost similar use case given under morphline.



    enter image description here




    In this figure, a Flume Source receives syslog events and sends them
    to a Flume Morphline Sink, which converts each Flume event to a record
    and pipes it into a readLine command. The readLine command extracts
    the log line and pipes it into a grok command. The grok command uses
    regular expression pattern matching to extract some substrings of the
    line. It pipes the resulting structured record into the loadSolr
    command. Finally, the loadSolr command loads the record into Solr,
    typically a SolrCloud. In the process, raw data or semi-structured
    data is transformed into structured data according to application
    modelling requirements.




    The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:



    enter image description here






    share|improve this answer





















    • This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command
      – cricket_007
      Mar 5 at 23:04












    • @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
      – gyan
      Mar 6 at 7:05










    • Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
      – Cosmin Ioniță
      Mar 6 at 8:48










    • morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.
      – gyan
      Mar 6 at 9:02






    • 1




      okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
      – Cosmin Ioniță
      Mar 6 at 10:03













    up vote
    1
    down vote



    accepted







    up vote
    1
    down vote



    accepted






    Cloudera has a guide which is having almost similar use case given under morphline.



    enter image description here




    In this figure, a Flume Source receives syslog events and sends them
    to a Flume Morphline Sink, which converts each Flume event to a record
    and pipes it into a readLine command. The readLine command extracts
    the log line and pipes it into a grok command. The grok command uses
    regular expression pattern matching to extract some substrings of the
    line. It pipes the resulting structured record into the loadSolr
    command. Finally, the loadSolr command loads the record into Solr,
    typically a SolrCloud. In the process, raw data or semi-structured
    data is transformed into structured data according to application
    modelling requirements.




    The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:



    enter image description here






    share|improve this answer












    Cloudera has a guide which is having almost similar use case given under morphline.



    enter image description here




    In this figure, a Flume Source receives syslog events and sends them
    to a Flume Morphline Sink, which converts each Flume event to a record
    and pipes it into a readLine command. The readLine command extracts
    the log line and pipes it into a grok command. The grok command uses
    regular expression pattern matching to extract some substrings of the
    line. It pipes the resulting structured record into the loadSolr
    command. Finally, the loadSolr command loads the record into Solr,
    typically a SolrCloud. In the process, raw data or semi-structured
    data is transformed into structured data according to application
    modelling requirements.




    The use case given in the example is what production tools like MapReduceIndexerTool, Apache Flume Morphline Solr Sink and Apache Flume MorphlineInterceptor and Morphline Lily HBase Indexer are running as part of their operation, as outlined in the following figure:



    enter image description here







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Mar 5 at 16:21









    gyan

    3,97011641




    3,97011641












    • This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command
      – cricket_007
      Mar 5 at 23:04












    • @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
      – gyan
      Mar 6 at 7:05










    • Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
      – Cosmin Ioniță
      Mar 6 at 8:48










    • morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.
      – gyan
      Mar 6 at 9:02






    • 1




      okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
      – Cosmin Ioniță
      Mar 6 at 10:03


















    • This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command
      – cricket_007
      Mar 5 at 23:04












    • @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
      – gyan
      Mar 6 at 7:05










    • Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
      – Cosmin Ioniță
      Mar 6 at 8:48










    • morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.
      – gyan
      Mar 6 at 9:02






    • 1




      okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
      – Cosmin Ioniță
      Mar 6 at 10:03
















    This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command
    – cricket_007
    Mar 5 at 23:04






    This doesn't really answer the question what kind of operations should I perform... unless you are referring to the grok command
    – cricket_007
    Mar 5 at 23:04














    @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
    – gyan
    Mar 6 at 7:05




    @cricket_007 the link has the details on how to do that including the sample code. Can't replicate the tutorial here, so has put an abstract and a possible algo. OP still needs to refer the complete cloudera guide to get through.
    – gyan
    Mar 6 at 7:05












    Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
    – Cosmin Ioniță
    Mar 6 at 8:48




    Thank you for the answer. The answer is somehow correct and complete. The point is that I already read the documentation that you mentioned but now I spent a bit more time on it. However I have one question: what is the actual purpose of the morphline? It just transforms the data into tokens in order to be easily indexable? Am I correct?
    – Cosmin Ioniță
    Mar 6 at 8:48












    morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.
    – gyan
    Mar 6 at 9:02




    morphline in bigdata world is similar to ETL in classical world. The purpose of morphline is to transform data from one state to another using a program/command. There have been development on this line to make it config driven and standardize; but I feel it is still emerging. May be someday, we will have a morphline framework which support plug-your-transformation in a better way.
    – gyan
    Mar 6 at 9:02




    1




    1




    okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
    – Cosmin Ioniță
    Mar 6 at 10:03




    okay, great, but I still need a clear answer to my question. The purpose of a morphline in the case of MapReduceIndexerTool is to transform the data into an easily indexable format? What is it's actual purpose when we want to index data using that map reduce job?
    – Cosmin Ioniță
    Mar 6 at 10:03












    up vote
    1
    down vote













    In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.



    For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:



    SOLR_LOCATOR : {
    collection : collection1
    zkHost : "127.0.0.1:2181/solr"
    }
    morphlines : [
    {
    id : morphline1
    importCommands : ["org.kitesdk.**"]
    commands : [
    {
    readAvroContainer {}
    }
    {
    extractAvroPaths {
    flatten : false
    paths : {
    id : /id
    field1_s : /field1
    field2_s : /field2
    }
    }
    }
    {
    sanitizeUnknownSolrFields {
    solrLocator : ${SOLR_LOCATOR}
    }
    }
    {
    loadSolr {
    solrLocator : ${SOLR_LOCATOR}
    }
    }
    ]
    }
    ]


    When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.



    This is command I'm using to index files and merge them to running collection:



    sudo -u hdfs hadoop --config /etc/hadoop/conf 
    jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
    --morphline-file /local/path/morphlines_file
    --output-dir hdfs://localhost/mrit/out
    --zk-host localhost:2181/solr
    --collection collection1
    --go-live
    hdfs:/mrit/in/my-avro-file.avro


    Solr should be configured to work with HDFS and collection should exist.



    All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.






    share|improve this answer



























      up vote
      1
      down vote













      In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.



      For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:



      SOLR_LOCATOR : {
      collection : collection1
      zkHost : "127.0.0.1:2181/solr"
      }
      morphlines : [
      {
      id : morphline1
      importCommands : ["org.kitesdk.**"]
      commands : [
      {
      readAvroContainer {}
      }
      {
      extractAvroPaths {
      flatten : false
      paths : {
      id : /id
      field1_s : /field1
      field2_s : /field2
      }
      }
      }
      {
      sanitizeUnknownSolrFields {
      solrLocator : ${SOLR_LOCATOR}
      }
      }
      {
      loadSolr {
      solrLocator : ${SOLR_LOCATOR}
      }
      }
      ]
      }
      ]


      When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.



      This is command I'm using to index files and merge them to running collection:



      sudo -u hdfs hadoop --config /etc/hadoop/conf 
      jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
      --morphline-file /local/path/morphlines_file
      --output-dir hdfs://localhost/mrit/out
      --zk-host localhost:2181/solr
      --collection collection1
      --go-live
      hdfs:/mrit/in/my-avro-file.avro


      Solr should be configured to work with HDFS and collection should exist.



      All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.






      share|improve this answer

























        up vote
        1
        down vote










        up vote
        1
        down vote









        In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.



        For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:



        SOLR_LOCATOR : {
        collection : collection1
        zkHost : "127.0.0.1:2181/solr"
        }
        morphlines : [
        {
        id : morphline1
        importCommands : ["org.kitesdk.**"]
        commands : [
        {
        readAvroContainer {}
        }
        {
        extractAvroPaths {
        flatten : false
        paths : {
        id : /id
        field1_s : /field1
        field2_s : /field2
        }
        }
        }
        {
        sanitizeUnknownSolrFields {
        solrLocator : ${SOLR_LOCATOR}
        }
        }
        {
        loadSolr {
        solrLocator : ${SOLR_LOCATOR}
        }
        }
        ]
        }
        ]


        When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.



        This is command I'm using to index files and merge them to running collection:



        sudo -u hdfs hadoop --config /etc/hadoop/conf 
        jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
        --morphline-file /local/path/morphlines_file
        --output-dir hdfs://localhost/mrit/out
        --zk-host localhost:2181/solr
        --collection collection1
        --go-live
        hdfs:/mrit/in/my-avro-file.avro


        Solr should be configured to work with HDFS and collection should exist.



        All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.






        share|improve this answer














        In general, in morplhine you only need to read your data, convert it to solr documents and then call loadSolr to create index.



        For example, this is moprhline file I used with MapReduceIndexerTools to upload Avro data into Solr:



        SOLR_LOCATOR : {
        collection : collection1
        zkHost : "127.0.0.1:2181/solr"
        }
        morphlines : [
        {
        id : morphline1
        importCommands : ["org.kitesdk.**"]
        commands : [
        {
        readAvroContainer {}
        }
        {
        extractAvroPaths {
        flatten : false
        paths : {
        id : /id
        field1_s : /field1
        field2_s : /field2
        }
        }
        }
        {
        sanitizeUnknownSolrFields {
        solrLocator : ${SOLR_LOCATOR}
        }
        }
        {
        loadSolr {
        solrLocator : ${SOLR_LOCATOR}
        }
        }
        ]
        }
        ]


        When run it reads avro container, maps avro fields to solr document fields, removes all other fields and uses provided Solr connection details to create index. It's based on this tutorial.



        This is command I'm using to index files and merge them to running collection:



        sudo -u hdfs hadoop --config /etc/hadoop/conf 
        jar /usr/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.MapReduceIndexerTool
        --morphline-file /local/path/morphlines_file
        --output-dir hdfs://localhost/mrit/out
        --zk-host localhost:2181/solr
        --collection collection1
        --go-live
        hdfs:/mrit/in/my-avro-file.avro


        Solr should be configured to work with HDFS and collection should exist.



        All this setup works for me with Solr 4.10 on CDH 5.7 Hadoop.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 21 at 21:13

























        answered Nov 21 at 20:54









        arghtype

        2,96992744




        2,96992744






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f49110654%2fhow-should-look-like-a-morphline-for-mapreduceindexertool%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Berounka

            Fiat S.p.A.

            Type 'String' is not a subtype of type 'int' of 'index'