Skip to content
This repository was archived by the owner on Feb 17, 2025. It is now read-only.

Update log_shipping_to_elasticsearch_part1.md #2

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 48 additions & 49 deletions tutorial-1/log_shipping_to_elasticsearch_part1.md
Original file line number Diff line number Diff line change
@@ -1,115 +1,114 @@

## Part 1 - Basic Log preparation
## Part 1 - Basic Log Preparation


### Creating a Pipeline
* Open the DataCollector and create a new pipeline.
* Launch the Data Collector console and create a new pipeline.

* Note: *If you'd like, feel free to download a previously created [pipeline](pipelines/Directory_to_ElasticSearch_Tutorial_Part_1.json) that has been configured with the contents of this tutorial. In the DataCollector Main Screen, select Import Pipeline to begin*
* Note: *If you'd like, feel free to download a previously created [pipeline](pipelines/Directory_to_ElasticSearch_Tutorial_Part_1.json) that has been configured with the contents of this tutorial. In the Data Collector Home page, select Import Pipeline to begin.*
<img style="width:80%;" src="img/import_pipeline.png">

#### Defining the source
* Drag the 'Directory' origin stage into your canvas.
#### Defining the Source
* Drag the Directory origin stage into your canvas.

* Go to the Configuration Settings below and Select the *Files* tab
* Go to the Configuration properties below and select the *Files* tab.

<img style="width:100%;" src="img/directory_config.png">

* Enter the following settings :
* Configure the following properties:

* **Data Format** - Log
* **Files Directory** - the absolute file path to the directory containing the sample .log.gz files
* **File Name Pattern** - `*.gz`
*this will pick up all .gz files in this folder, you can use any wildcard to narrow down your selection*
* **Files Compression** - Compressed File

* In the *Post Processing* tab, make sure **File Post Processing** is set to None.

* In the *Post Processing* tab make sure **File Post Processing** is set to None.

*Note: This dropdown also lets you delete source files after they have been processed. You may want to use this in your production systems once you have verified your pipelines are configured correctly.*
*Note: This drop down also lets you delete source files after they have been processed. You may want to use this in your production systems once you have verified your pipelines are configured correctly.*
<img style="width:100%;" src="img/directory_config_postproc.png">

* In the **Log** Tab set the **Log Format** option to Combined Log Format.

*Note:The Data Collector already knows the format of the Combined Log Format and a few other log types, and has built in RegEx patterns to decode them. If you are working with custom log formats choose either Regular Expression or Grok Pattern from the dropdown and define your own format. *
*Note: Data Collector already knows the format of the Combined Log Format and a few other log types, and has built in regex patterns to decode them. If you are working with custom log formats, choose either Regular Expression or Grok Pattern from the menu and define your own format. *
<img style="width:100%;" src="img/directory_config_log.png">

#### Defining the 'geo' field
* Drag and drop an 'Expression Evaluator' processor into the canvas.
* Drag and drop an Expression Evaluator processor into the canvas.

* In its Configuration, select the *Expressions* Tab
* In its Configuration properties, select the *Expressions* tab.

* Under Field Expressions add an output field called **/geo** and set the field expression to `${emptyMap()}`
*This creates a Map data structure to hold the value of the geo object that we will populate later*
* Under Field Expressions, add an output field called **/geo** and set the field expression to `${emptyMap()}`.
*This creates a Map data structure to hold the value of the geo object that we will populate later.*

<img style="width:100%;" src="img/expression_eval.png">


#### Converting Fields
By default the Data Collector will read the fields in the log file as string values, this works for most fields however we know that Web Server logs contain numeric values for Response Code, Bytes Transferred and a Date Time stamp. Let's convert these fields into the right data types.
By default Data Collector will read the fields in the log file as string values. This works for most fields, however we know that web server logs contain numeric values for Response Code, Bytes Transferred and a Date Time stamp. Let's convert these fields to the right data types.

* Drag and drop a 'Field Converter' stage into the pipeline.
* Drag and drop a Field Converter processor into the pipeline.

* Go to its Configuration and select the 'Conversions' tab.
* Go to the Configuration properties and select the Conversions tab.
<img style="width:100%;" src="img/field_converter.png">
* In the **Fields to Convert** textbox type `/bytes` and set **Convert to Type** as LONG.
* In the **Fields to Convert** text box, type `/bytes` and set **Convert to Type** to LONG.

* Click the `+` button to add another conversion.
* Click the `+` icon to add another conversion.

* In the new row, set **Fields to Convert** to `/response` and set **Convert to Type** as INTEGER.
* In the new row, set **Fields to Convert** to `/response` and set **Convert to Type** to INTEGER.

* Click the `+` button to add another conversion.
* Click the `+` icon to add another conversion.

* In the new row, set **Fields to Convert** to `/timestamp` and set **Convert to Type** as DATETIME. Set **Date Format** to Other and in the **Other Date Format** textbox type `dd/MMM/y:H:m:s Z`
*You can use [Java DateTime format specifiers](https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html) to change the format to suit your needs*
* In the new row, set **Fields to Convert** to `/timestamp` and set **Convert to Type** to DATETIME. Set **Date Format** to Other and in the **Other Date Format** text box type `dd/MMM/y:H:m:s Z`.
*You can use [Java DateTime format specifiers](https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html) to change the format to suit your needs.*
<img style="width:100%;" src="img/field_converter_timestamp.png">

#### Performing a GeoIP Lookup
* Download a copy of the MaxMind free [GeoIP2 Lite City Database](http://geolite.maxmind.com/download/geoip/database/GeoLite2-City.mmdb.gz). Move the downloaded file to the StreamSets Resources Folder and unzip it there.
* Download a copy of the MaxMind free [GeoIP2 Lite City Database](http://geolite.maxmind.com/download/geoip/database/GeoLite2-City.mmdb.gz). Move the downloaded file to the StreamSets Resources folder and unzip it there.

* Back in the Data Collector, drag and drop the 'GeoIP' Stage into the Canvas
* Back in the Data Collector console, drag and drop the GeoIP processor into the canvas.

* Go to its Configuration and select the *Geolocations* tab.
* In its Configuration properties, select the *Geolocations* tab.
<img style="width:100%;" src="img/geo_ip.png">
* In the 'GeoIP2 Database File' textbox type 'GeoLite2-City.mmdb'
* In the GeoIP2 Database File text box, type 'GeoLite2-City.mmdb'.

* Under 'Input Field Name' type `/clientip` and set its corresponding 'Output Field Name' to `/city` and select `CITY_NAME` under 'GeoIP2 Field'.
* Under Input Field Name, type `/clientip`, set its corresponding Output Field Name to `/city`, and set GeoIP2 Field to `CITY_NAME`.

* Hit `+` to add another field, set 'Input Field Name' to `/clientip` and set 'Output Field Name' to `/lat` and 'GeoIP2 Field' to `LATITUDE`
* Hit `+` to add another field, set Input Field Name to `/clientip`, set Output Field Name to `/lat`, and set GeoIP2 Field to `LATITUDE`.

* Hit `+` again to add the last field, set 'Input Field Name' to `/clientip` and set 'Output Field Name' to `/lon` and 'GeoIP2 Field' to `LONGITUDE`
* Hit `+` again to add the last field, set Input Field Name to `/clientip`, set Output Field Name to `/lon`, and set GeoIP2 Field to `LONGITUDE`

#### Setting up ElasticSearch
#### Setting up Elasticsearch

* Finally lets specify a destination, drag and Drop a 'ElasticSearch' stage to the Canvas.
* Finally, to configure a destination, drag and drop an Elasticsearch destination to the canvas.

* Go to its Configuration and select the 'General' Tab. In the drop down for 'Stage Library' select the version of ElasticSearch you are running.
* In its Configuration properties, select the General tab. In the menu for Stage Library, select the version of Elasticsearch you are running.

<img style="width:100%;" src="img/elastic_config.png">

* Go to the 'ElasticSearch' Tab and in the 'Cluster Name' textbox enter the name of your cluster as specified in elasticsearch.yml
* Go to the Elasticsearch tab, and in the Cluster Name text box enter the name of your cluster as specified in elasticsearch.yml.

* In the 'Cluster URI' field specify the host:port where your ElasticSearch service is running
* In the Cluster URI field, specify the host:port where your Elasticsearch service is running.

* In 'Index' and 'Mapping' textboxes write `logs`. This is the index and mapping we setup earlier in this tutorial.
* In Index and Mapping text boxes write `logs`. This is the index and mapping that we set up earlier in this tutorial.

* Finally before we do anything with the Pipeline click on any blank spot on the canvas, go to Configuration and the 'Error Records' tab. And under 'Error Records' select 'Discard(Library:Basic)'. This effectively tells the system to discard any erroneous data.
In a real production system you can choose to send error records to a number of different systems.
* Finally before we do anything with the pipeline, click on any blank spot on the canvas. In the pipeline configuration properties, click the Error Records tab. For the Error Records property, select 'Discard (Library:Basic)'. This effectively tells the pipeline to discard any erroneous data.
In a real production system, you can choose to send error records to a number of different systems.
<img style="width:100%;" src="img/discard_errors.png">
### Preview the pipeline
After you setup the pipeline you can hit the Preview button to examine the flow of a small subset of the data.
### Preview the Pipeline
After you set up the pipeline, you can hit the Preview icon to examine the flow of a small subset of the data.

The preview mode lets you interactively debug your stage configurations.
Preview mode lets you interactively debug your stage configurations.

#### Lets ship some logs
* Once the pipeline has been setup Hit the 'Start' button to execute the pipeline.
#### Let's Ship Some Logs
* Once the pipeline has been set up, hit the Start icon to execute the pipeline.

* At this point the system should start reading off the origin directory and sending data into ElasticSearch.
* At this point, Data Collector should start reading off the origin directory and sending data into Elasticsearch.
<img style="width:100%;" src="img/running_pipeline.png">

* You can fireup a [Kibana Dashboard](kibana/ApacheWebLog.json) to view the results of the import into ElasticSearch
* You can fire up a [Kibana Dashboard](kibana/ApacheWebLog.json) to view the results of the import into Elasticsearch
<img style="width:100%;" src="img/part1_kibana_dashboard.png">
*Notice that the Browser Type graph doesn't show up in Kibana, we'll take care of that next in Part 2*
*Notice that the Browser Type graph doesn't show up in Kibana, we'll take care of that next in Part 2.*

## Where to go from here
* In [Part 2](log_shipping_to_elasticsearch_part2.md) of this tutorial we will see how to write custom Python code to enhance our log data. We will also setup Metric Alerts as we prepare the pipeline for production use.
## What Next?
* In [Part 2](log_shipping_to_elasticsearch_part2.md) of this tutorial, we will see how to write custom Python code to enhance our log data. We will also set up metric alerts as we prepare the pipeline for production use.