# elasticsearch Introduction to Elasticsearch ## Technology [Nginx](https://www.elastic.co/) ## Objective Create an ELK cluster with [Docker Compose](https://docs.docker.com/compose/). Play with Elasticsearch, Kibana, and Logstash. ## References - [Set up Elasticsearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html) - [Beginner's Crash Course to Elastic Stack - Part 1: Intro to Elasticsearch and Kibana](https://www.youtube.com/watch?v=gS_nHTWZEJ8) - [Table of Contents: Beginner's Crash Course to Elastic Stack Series](https://github.com/LisaHJung/Beginners-Crash-Course-to-Elastic-Stack-Series-Table-of-Contents) - [Beginner's guide to building a full stack JavaScript web app with Elasticsearch](https://github.com/LisaHJung/beginners-guide-to-creating-a-full-stack-Javascript-app-with-Elasticsearch) - [Elasticsearch 8 and the Elastic Stack: In Depth and Hands On](https://www.udemy.com/course/elasticsearch-7-and-elastic-stack) ## Dependencies - [Introduction to Docker](https://github.com/AlexPaar/docker-intro) - [News Category Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset) - [E-Commerce Data](https://www.kaggle.com/datasets/carrie1/ecommerce-data) ## Overview Elasticsearch... - Started off as scalable Lucene - Horizontally scalable search engine - Each _shard_ is an inverted index / set of doc values - Not only for text, but also useful for structured data, fast substitute for Hadoop or Spark Kibana... - Web UI for searching and visualization - Complex aggregations, graphs, charts - Often used for log analysis Logstash / Beats... - Ways to feed data into Elasticsearch - FileBeat can monitor log files, parse them and import into Elasticsearch in near-real-time - Logstash also pushes data into Elasticsearch from many sources X-Pack... - Paid add-on - Security, alerting, montoring, reporting, machine learning, graph exploration What's new in Elasticsearch 8? - The concept of document types is gone for good - Data streams are now mature, based on index lifecycle management - Security enabled by default and is tighter (need an enrollment token) - NLP via imported PyTorch models (inference at ingest) - _Serverless log ingestion_ from AWS to Elastic Cloud - Elastic Agents for Azure and Cassandra - Vector similarity / kNN search (experimental), computes document similarity - Machine learning (experimental), e.g. anomaly detection to find out server failures - New Canvas editor - Maps / vector tile support - New Kibana UI - Enterprise Search (integrate Elasticsearch with mobile Apps, OneDrive) - V7 compatibility mode ## Development ### Create and Configure an ELK Cluster with Docker Compose Create the **docker-compose.yml** file. Start up the cluster. ```bash docker-compose up ``` On node **elasticsearch**, check the _/usr/share/elasticsearch/config/elasticsearch.yml_ configuration file (attach with a [Dev Container](https://code.visualstudio.com/docs/devcontainers/attach-container)). The Elastic API follows the syntax **GET \_API/parameter**. There are various [REST APIs](https://www.elastic.co/guide/en/elasticsearch/reference/current/rest-apis.html) available. Note the [API conventions](https://www.elastic.co/guide/en/elasticsearch/reference/current/api-conventions.html) and the [common options](https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html). Get info about the cluster health. ```bash GET _cluster/health ``` Get info about nodes in a cluster Get info about nodes in a cluster ```bash GET _nodes/stats ``` ### Perform CRUD Operations Create an index with **PUT Name-of-the-Index** #### C - Create Create documents with the [Index API](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html). POST a document and let Elasticsearch create an id. ```bash POST name-of-the-index/_doc { "field": "value" } ``` PUT a document and provide an id. ```bash PUT name-of-the-index/_doc/id { "field": "value" } ``` When you index a document using an id that already exists, the existing document is overwritten by the new document. If you do not want a existing document to be overwritten, you can use the **\_create** endpoint! With the **\_create** Endpoint, no indexing will occur and you will get a 409 error message. To update an existing document, you must use the **\_doc** resource. ```bash PUT name-of-the-index/_create/id { "field": "value" } ``` #### R - Read Syntax: ```bash GET name-of-the-index/_doc/id ``` #### U - Update Syntax: ```bash POST name-of-the-index/_update/id { "doc": { "field1": "value", "field2": "value", } } ``` #### D - Delete Syntax: ```bash DELETE name-of-the-index/_doc/id ``` #### Samples ```bash POST dhsh/_doc { "name": "Alice", "major": "BWL" } GET dhsh/_search PUT dhsh/_doc/23 { "name": "Bob", "major": "BWL" } POST dhsh/_update/23 { "doc": { "name": "Bob", "major": "WINF" } } GET dhsh/_doc/23 PUT dhsh/_doc/23 { "name": "Bob", "major": "WINF" } DELETE dhsh/_doc/23 DELETE dhsh ``` ### Relevance of a Search Explain _precision_ and _recall_. Precision and recall determine which documents are included in the search results. Precision and recall do not determine which of the documents are more relevant compared to the other. _Ranking_ refers to ordering the results based on their _score_. Download the [News Category Dataset](https://www.kaggle.com/datasets/rmisra/news-category-dataset) and in Kibana upload it into an Elasticsearch _news_ index. Check out the index data and mapping. ```bash GET news/_search GET news/_search { "track_total_hits": true } GET news/_mapping ``` Search for data within a specific time range. ```bash GET news/_search { "query": { "range": { "date": { "gte": "2015-06-20", "lte": "2015-09-22" } } } } GET news/_mapping ``` The articles fall into different categories. Thus, analyze the data to show the categories of news headlines. Use an aggregation that summarizes your data as metrics, statistics, and other analytics. ```bash GET news/_search { "aggs": { "by_category": { "terms": { "field": "category", "size": 100 } } } } ``` In the result, the **aggregations** field contains the up to 100 categories found in the index documents. Next, search for the most significant term in the _ENTERTAINMENT_ category. Check out the _popular_in_entertainment_ aggregation report. ```bash GET news/_search { "query": { "match": { "category": "ENTERTAINMENT" } }, "aggregations": { "popular_in_entertainment": { "significant_text": { "field": "headline" } } } } ``` Run a match query with ORed search terms. This gives us a high recall. ```bash GET news/_search { "query": { "match": { "headline": { "query": "Khloe Kardashian Kendall Jenner" } } } } ``` If a document contains one of the search terms, Elasticsearch will consider that document as a hit. OR logic results in a higher number of hits, thereby increasing recall. However, the hits are loosely related to the query and lowering precision as a result. We can increase precision by adding an AND operator to the query. The AND operator will result in getting more precise matches, thereby increasing precision. However, it will reduce the number of hits returned, resulting in lower recall. ```bash GET news/_search { "query": { "match": { "headline": { "query": "Khloe Kardashian Kendall Jenner", "operator": "and" } } } } ``` The **minimum_should_match** parameter allows you to specify the minimum number of terms a document should have to be included in the search results. This parameter gives you more control over fine tuning precision and recall of your search. ```bash GET news/_search { "query": { "match": { "headline": { "query": "Khloe Kardashian Kendall Jenner", "minimum_should_match": 3 } } } } ``` ### Running Full Text Queries and Combined Queries With a **match** query, the order of the search terms is not significant. If the order of the search terms is significant use a **match_phrase** query. The overall syntax is identical. Pull up articles about the Ed Sheeran song "Shape of you" as follows. ```bash GET news/_search { "query": { "match_query": { "headline": { "query": "Shape of You" } } } } ``` #### Running a match query against multiple fields When designing a query, you don't always know the context of a user's search. When a user searches for "Donald Trump", the user could be searching for statements written by Michelle Obama or articles written about her. To accommodate these contexts, you can write a multi_match query, which searches for terms in multiple fields. The multi_match query runs a match query on multiple fields and calculates a score for each field. Then, it assigns the highest score among the fields to the document. This score will determine the ranking of the document within the search results. ```bash GET news/_search { "query": { "multi_match": { "query": "Donald Trump", "fields": [ "headline", "short_description", "authors" ] } } } ``` #### Per-field boosting Headlines mentioning "Christiano Ronaldo" in the field _headline_ are more likely to be related to our search than the headlines that mention "Christiano Ronaldo" once or twice in the field _short_description_. To improve the precision of your search, you can designate one field to carry more weight than the others. This can be done by boosting the score of the field _headline_ (_per-field boosting_). This is notated by adding a hat(**^**) symbol and number 2 to the desired field as shown below. Per-field boosting yields same number of hits. However, it changes the ranking of the hits. The hits ranked higher on the list contains the search terms "Christiano Ronaldo" in the boosted field, _headline_. ```bash GET news/_search { "query": { "multi_match": { "query": "Christiano Ronaldo", "fields": [ "headline^2", "short_description", "authors" ] } } } ``` #### Improving precision with phrase type match You can improve the precision of a **multi_match** query by adding the **"type":"phrase"** to the query. The phrase type performs a **match_phrase** query on each field and calculates a score for each field. Then, it assigns the highest score among the fields to the document. ```bash GET news/_search { "query": { "multi_match": { "query": "Cristiano Ronaldo", "fields": [ "headline^2", "short_description", "authors" ], "type": "phrase" } } } ``` #### Combined queries The [bool query](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/query-dsl-bool-query.html#:~:text=Bool%20Queryedit,clause%20with%20a%20typed%20occurrence.) retrieves documents matching boolean combinations of other queries. With the _bool query_, you can combine multiple queries into one request and further specify boolean clauses to narrow down your search results. A bool query can help you answer multi-faceted questions. There are four clauses to choose from: - **must** - **must_not** - **should** - **filter** The **must** clause defines all queries (criteria) a document MUST match to be returned as hits. These criteria are expressed in the form of one or multiple queries. All queries in the **must** clause must be satisfied for a document to be returned as a hit. As a result, having more queries in the must clause will increase the precision of your query. The **must_not** clause defines queries(criteria) a document MUST NOT match to be included in the search results. The **should** clause adds "nice to have" queries (criteria). The documents do not need to match the "nice to have" queries to be considered as hits. However, the ones that do will be given a higher score so it shows up higher in the search results. The **filter** clause contains filter queries that place documents into either "yes" or "no" category. For example, let's say you are looking for headlines published within a certain time range. Some documents will fall within this range (yes) or do not fall within this range (no). The filter clause only includes documents that fall into the yes category. The **filter** clause does not contribute to the score. Use filters when you can - they are faster and cacheable. ```bash GET news/_search { "query": { "bool": { "must": [ { "match_phrase": { "headline": "Angela Merkel" } }, { "match": { "headline": "Putin" } } ], "should": [ { "match": { "headline": "Chancellor" } } ], "filter":{ "range":{ "date": { "gte": "2014-01-01", "lte": "2021-09-21" } } } } } } ``` ### Running Aggregations with Elasticsearch and Kibana Download the [E-Commerce Data](https://www.kaggle.com/datasets/carrie1/ecommerce-data) and in Kibana upload it into an Elasticsearch _ecommerce_tmp_ index. Often times, the dataset is not optimal for running requests in its original state. For example, the type of a field may not be recognized by Elasticsearch or the dataset may contain a value that was accidentally included in the wrong field and etc. These are exact problems that I ran into while working with this dataset. The following are the requests that I sent to yield the results shared during the workshop. Copy and paste these requests into the Kibana console (Dev Tools) and run these requests in the order shown below. ```bash PUT ecommerce { "mappings": { "properties": { "Country": { "type": "keyword" }, "CustomerID": { "type": "long" }, "Description": { "type": "text" }, "InvoiceDate": { "type": "date", "format": "M/d/yyyy H:m" }, "InvoiceNo": { "type": "keyword" }, "Quantity": { "type": "long" }, "StockCode": { "type": "keyword" }, "UnitPrice": { "type": "double" } } } } POST _reindex { "source": { "index": "ecommerce_tmp" }, "dest": { "index": "ecommerce" } } DELETE ecommerce_tmp ``` Check out the index data and mapping. ```bash GET ecommerce/_search GET ecommerce/_search { "track_total_hits": true } GET ecommerce/_mapping ``` #### Remove the negative values from the field _UnitPrice_ When you explore the minimum unit price in this dataset, you will see that the minimum unit price value is -11062.06. To keep our data simple, I used the **delete_by_query** API to remove all unit prices less than 0. ```bash POST ecommerce/_delete_by_query { "query": { "range": { "UnitPrice": { "lte": 0 } } } } ``` #### Remove values greater than 500 from the field _UnitPrice_ When you explore the maximum unit price in this dataset, you will see that the maximum unit price value is 38,970. When the data is manually examined, the majority of the unit prices are less than 500. The max value of 38,970 would skew the average. To simplify our demo, I used the **delete_by_query** API to remove all unit prices greater than 500. ```bash POST ecommerce/_delete_by_query { "query": { "range": { "UnitPrice": { "gte": 500 } } } } ``` #### Metric Aggregations Metric aggregations are used to compute numeric values based on your dataset. They can be used to calculate the values of, for example, sum, min, max, avg, unique count (cardinality). In general, the syntax of aggregation requests is as follows. ```bash GET Enter_name_of_the_index_here/_search { "aggs": { "Name your aggregations here": { "Specify the aggregation type here": { "field": "Name the field you want to aggregate on here" } } } } ``` Compute the **sum** of all unit prices in the index as follows. By default, Elasticsearch returns top 10 hits. When you minimize **hits**, you will see the aggregations results **sum_unit_price**. It displays the sum of all unit prices present in our index. ```bash GET ecommerce/_search { "aggs": { "sum_unit_price": { "sum": { "field": "UnitPrice" } } } } ``` If the purpose of running an aggregation is solely to get the aggregations results, you can add a **size** parameter and set it to 0 as shown below. This parameter prevents Elasticsearch from fetching the top 10 hits so that the aggregations results are shown at the top of the response. ```bash GET ecommerce/_search { "size": 0, "aggs": { "sum_unit_price": { "sum": { "field": "UnitPrice" } } } } ``` Compute the lowest (**min**) unit price of an item. ```bash GET ecommerce/_search { "size": 0, "aggs": { "lowest_unit_price": { "min": { "field": "UnitPrice" } } } } ``` Compute the highest (**max**) unit price of an item. ```bash GET ecommerce/_search { "size": 0, "aggs": { "highest_unit_price": { "max": { "field": "UnitPrice" } } } } ``` Compute the average (**average**) unit price of an item. ```bash GET ecommerce/_search { "size": 0, "aggs": { "average_unit_price": { "avg": { "field": "UnitPrice" } } } } ``` Compute the **count**, **min**, **max**, **avg**, **sum** in one go with the **stats** aggregation ```bash GET ecommerce/_search { "size": 0, "aggs": { "all_stats_unit_price": { "stats": { "field": "UnitPrice" } } } } ``` The cardinality aggregation computes the count of unique values for a given field. ```bash GET ecommerce/_search { "size": 0, "aggs": { "unique_customers_count": { "cardinality": { "field": "CustomerID" } } } } ``` In the previous examples, aggregations were performed on all documents in the ecommerce_data index. What if you want to run an aggregation on a subset of the documents? For example, our index contains e-commerce data from multiple countries. Let's say you want to calculate the average unit price of items sold in Germany. To limit the scope of the aggregation, you can add a query clause to the aggregations request. The query clause defines the subset of documents that aggregations should be performed on. The combined query and aggregations look like the following. ```bash GET ecommerce/_search { "size": 0, "query": { "match": { "Country": "Germany" } }, "aggs": { "germany_average_unit_price": { "avg": { "field": "UnitPrice" } } } } ``` #### Bucket Aggregations When you want to aggregate on several subsets of documents, bucket aggregations will come in handy. Bucket aggregations group documents into several sets of documents called buckets. All documents in a bucket share common criteria. The following are different types of bucket aggregations. - Date Histogram Aggregation - Histogram Aggregation - Range Aggregation - Terms aggregation ##### Date Histogram Aggregation When you are looking to group data by time interval, the **date_histogram** aggregation will prove very useful. Our _ecommerce_ index contains transaction data that has been collected over time (from the year 2010 to 2011). If we are looking to get insights about transactions over time, our first instinct should be to run the **date_histogram** aggregation. There are two ways to define a time interval with **date_histogram** aggregation. These are **fixed_interval** and **calendar_interval**. With the **fixed_interval**, the interval is always constant. Create, for example, a bucket for every 8 hour interval. ```bash GET ecommerce/_search { "size": 0, "aggs": { "transactions_by_8_hrs": { "date_histogram": { "field": "InvoiceDate", "fixed_interval": "8h" } } } } ``` With the **calendar_interval**, the interval may vary. For example, we could choose a time interval of day, month or year. But daylight savings can change the length of specific days, months can have different number of days, and leap seconds can be tacked onto a particular year. So the time interval of day, month, or leap seconds could vary. A scenario where you might use the **calendar_interval** is when you want to calculate the monthly revenue. Elasticsearch creates monthly buckets. Within each bucket, the date (monthly interval) is included in the field **key_as_string**. The field **key** shows the same date represented as a timestamp. The field **doc_count** shows the number of documents that fall within the time interval. ```bash GET ecommerce/_search { "size": 0, "aggs": { "transactions_by_month": { "date_histogram": { "field": "InvoiceDate", "calendar_interval": "1M" } } } } ``` By default, the date_histogram aggregation sorts buckets based on the "key" values in ascending order. To reverse this order, you can add an order parameter to the aggregations as shown below. Then, specify that you want to sort buckets based on the "\_key" values in descending(desc) order. ```bash GET ecommerce/_search { "size": 0, "aggs": { "transactions_by_month": { "date_histogram": { "field": "InvoiceDate", "calendar_interval": "1M", "order": { "_key": "desc" } } } } } ``` The following snippet calculates the sum of all the total monthly sales buckets. **buckets_path** instructs this **sum_bucket** aggregation that we want the sum of the sales aggregation in the **transactions_by_month** date histogram. ```bash GET ecommerce/_search { "size": 0, "aggs": { "transactions_by_month": { "date_histogram": { "field": "InvoiceDate", "calendar_interval": "1M" }, "aggs": { "sales": { "sum": { "field": "UnitPrice" } } } }, "sum_monthly_transactions": { "sum_bucket": { "buckets_path": "transactions_by_month>sales" } } } } ``` ##### Histogram Aggregation With the **date_histogram** aggregation, we were able to create buckets based on time intervals. The **histogram** aggregation creates buckets based on any numerical interval. Create, for example, buckets based on a price interval that increases by increments of 10. ```bash GET ecommerce/_search { "size": 0, "aggs": { "transactions_per_price_interval": { "histogram": { "field": "UnitPrice", "interval": 10 } } } } ``` ##### Range Aggregation The range aggregation is similar to the histogram aggregation in that it can create buckets based on any numerical interval. The difference is that the range aggregation allows you to define intervals of varying sizes so you can customize it to your use case. For example, what if you wanted to know the number of transactions for items from varying price ranges (between 0 and $50, between $50-$200, and between $200 and up)? The range aggregation is sorted based on the input ranges you specify and it cannot be sorted any other way! ```bash GET ecommerce/_search { "size": 0, "aggs": { "transactions_per_custom_price_ranges": { "range": { "field": "UnitPrice", "ranges": [ { "to": 50 }, { "from": 50, "to": 200 }, { "from": 200 } ] } } } } ``` ##### Terms Aggregation The **terms** aggregation creates a new bucket for every unique term it encounters for the specified field. It is often used to find the most frequently found terms in a document. For example, let's say you want to identify 5 customers with the highest number of transactions (i.e. documents). ```bash GET ecommerce/_search { "size": 0, "aggs": { "top_5_customers": { "terms": { "field": "CustomerID", "size": 5 } } } } ``` By default, the **terms** aggregation sorts buckets based on the **doc_count** values in descending order. To reverse this order, you can add an order parameter to the aggregation. Then, specify that you want to sort buckets based on the **\_count** values in ascending (**asc**) order. ```bash GET ecommerce/_search { "size": 0, "aggs": { "5_customers_with_lowest_number_of_transactions": { "terms": { "field": "CustomerID", "size": 5, "order": { "_count": "desc" } } } } } ``` #### Combined Aggregations So far, we have ran metric aggregations or bucket aggregations to answer simple questions. There will be times when we will ask more complex questions that require running combinations of these aggregations. For example, let's say we wanted to know the sum of revenue per day. To get the answer, we need to first split our data into daily buckets(**date_histogram** aggregation). Within each bucket, we need to perform metric aggregations to calculate the daily revenue. ```bash GET ecommerce/_search { "size": 0, "aggs": { "transactions_per_day": { "date_histogram": { "field": "InvoiceDate", "calendar_interval": "day" }, "aggs": { "daily_revenue": { "sum": { "script": { "source": "doc['UnitPrice'].value * doc['Quantity'].value" } } } } } } } ``` You can also calculate multiple metrics per bucket. For example, let's say you wanted to calculate the daily revenue and the number of unique customers per day in one go. To do this, you can add multiple metric aggregations per bucket as shown below. ```bash GET ecommerce/_search { "size": 0, "aggs": { "transactions_per_day": { "date_histogram": { "field": "InvoiceDate", "calendar_interval": "day" }, "aggs": { "daily_revenue": { "sum": { "script": { "source": "doc['UnitPrice'].value * doc['Quantity'].value" } } }, "number_of_unique_customers_per_day": { "cardinality": { "field": "CustomerID" } } } } } } ``` You do not always need to sort by time interval, numerical interval, or by **doc_count**. You can also sort by metric value of sub-aggregations. Let's take a look at the request below. Within the sub-aggregation, metric values **daily_revenue** and **number_of_unique_customers_per_day** are calculated. Let's say you wanted to find which day had the highest daily revenue to date. All you need to do is to add the **order** parameter (and sort buckets based on the metric value of **daily_revenue** in descending (**desc**) order. ```bash GET ecommerce/_search { "size": 0, "aggs": { "transactions_per_day": { "date_histogram": { "field": "InvoiceDate", "calendar_interval": "day", "order": { "daily_revenue": "desc" } }, "aggs": { "daily_revenue": { "sum": { "script": { "source": "doc['UnitPrice'].value * doc['Quantity'].value" } } }, "number_of_unique_customers_per_day": { "cardinality": { "field": "CustomerID" } } } } } } ``` ### Mapping with Elasticsearch and Kibana Mapping determines how a document and its fields are indexed and stored by defining the type of each field. It contains a list of the names and types of fields in an index. Depending on its type, each field is indexed and stored differently in Elasticsearch. ```bash POST produce/_doc { "name": "Pineapple", "botanical_name": "Ananas comosus", "produce_type": "Fruit", "country_of_origin": "New Zealand", "date_purchased": "2020-06-02T12:15:35", "quantity": 200, "unit_price": 3.11, "description": "a large juicy tropical fruit consisting of aromatic edible yellow flesh surrounded by a tough segmented skin and topped with a tuft of stiff leaves.These pineapples are sourced from New Zealand.", "vendor_details": { "vendor": "Tropical Fruit Growers of New Zealand", "main_contact": "Hugh Rose", "vendor_location": "Whangarei, New Zealand", "preferred_vendor": true } } GET produce/_mapping ``` #### Dynamic Mapping When a user does not define mapping in advance, Elasticsearch creates or updates the mapping as needed by default. This is known as _dynamic mapping_. With dynamic mapping, Elasticsearch looks at each field and tries to infer the data type from the field content. Then, it assigns a type to each field and creates a list of field names and types known as mapping. Depending on the assigned field type, each field is indexed and primed for different types of requests (full text search, aggregations, sorting). This is why mapping plays an important role in how Elasticsearch stores and searches for data. #### Indexing Strings There are two kinds of string field types: - Text - Keyword By default, every string gets mapped twice as a text field and as a keyword multi-field. Each field type is primed for different types of requests. **text** field type is designed for full-text searches. **keyword** field type is designed for exact searches, aggregations, and sorting. You can customize your mapping by assigning the field type as either text or keyword or both. ##### Text Field Type The **text** field type facilitates text analysis: Ever notice that when you search in Elasticsearch, it is not case sensitive or punctuation does not seem to matter? This is because text analysis occurs when your fields are indexed. By default, strings are analyzed when it is indexed. The string is broken up into individual words also known as tokens. The analyzer further lowercases each token and removes punctuations. Once the string is analyzed, the individual tokens are stored in a sorted list known as the _inverted index_. Each unique token is stored in the inverted index with its associated id. The same process occurs every time you index a new document. ##### Keyword Field Type The **keyword** field type is used for aggregations, sorting, and exact searches. These actions look up the document ID to find the values it has in its fields. A **keyword** field is suited to perform these actions because it uses a data structure called _doc values_ to store data. For each document, the document id along with the field value (original string) are added to the table. This data structure (i.e. _doc values_) is designed for actions that require looking up the document id to find the values it has in its fields. When Elasticsearch dynamically creates a mapping for you, it does not know what you want to use a string for so it maps all strings to both field types. In cases where you do not need both field types, the default setting is wasteful. Since both field types require creating either an inverted index or doc values, creating both field types for unnecessary fields will slow down indexing and take up more disk space. This is why we define our own mapping as it helps us store and search data more efficiently. #### Mapping Exercise "Produce Warehouse" Project: Build an app for a client who manages a produce warehouse This app must enable users to: 1. search for produce name, country of origin and description 2. identify top countries of origin with the most frequent purchase history 3. sort produce by produce type (Fruit or Vegetable) 4. get the summary of monthly expense See the sample data above. Feature 1 requires the **text** field type for the produce **name**, **country_of_origin**, and **description** fields. Feature 2 requires, for terms aggregation, the **keyword** field type for the **country_of_origin** field. Feature 3 requires, for sorting, the **keyword** field type for the **produce_type** field. Feature 4 requires, for **date_histogram** aggregation, the **date** type for the **date_purchased** field. #### Defining your own mapping If you do not define a mapping ahead of time, Elasticsearch dynamically creates the mapping for you. If you do decide to define your own mapping, you can do so at index creation. ONE mapping is defined per index. Once the index has been created, we can only add new fields to a mapping. We CANNOT change the mapping of an existing field. If you must change the type of an existing field, you must create a new index with the desired mapping, then reindex all documents into the new index. Step 1: Index a sample document into a **produce** index (done, see above) Step 2: Step 2: View the dynamic mapping (done, see above) Step 3: Create a new index with the optimized mapping as follows ```bash PUT produce_optimized { "mappings": { "properties": { "botanical_name": { "enabled": false }, "country_of_origin": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }, "date_purchased": { "type": "date" }, "description": { "type": "text" }, "name": { "type": "text" }, "produce_type": { "type": "keyword" }, "quantity": { "type": "long" }, "unit_price": { "type": "float" }, "vendor_details": { "enabled": false } } } } ``` Step 4: Check the mapping of the new index ```bash GET produce_optimized/_mapping ``` Step 5: Index your data into the new index ```bash PUT produce_optimized { "mappings": { "properties": { "botanical_name": { "enabled": false }, "country_of_origin": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }, "date_purchased": { "type": "date" }, "description": { "type": "text" }, "name": { "type": "text" }, "produce_type": { "type": "keyword" }, "quantity": { "type": "long" }, "unit_price": { "type": "float" }, "vendor_details": { "enabled": false } } } } ``` ... or reindex the data from the original index into the new one. ```bash POST _reindex { "source": { "index": "produce" }, "dest": { "index": "produce_optimized" } } ``` ### Visualizing Weather Data with Kibana Purchase historical weather data from [OpenWeather](https://openweathermap.org/) and check out the [structure](https://openweathermap.org/history-bulk) of the JSON objects. Preprocess the data as needed (e.g., timestamp seconds in OpenWeather vs. millis in Elasticsearch) In Elasticsearch, create an index with a suitable explicit mapping as follows. ```bash PUT /weatherdata { "settings": { "number_of_shards": 1 }, "mappings": { "properties": { "dt": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss||epoch_second" }, "timezone": { "type": "integer" }, "main": { "properties": { "temp": { "type": "float" }, "temp_min": { "type": "float" }, "temp_max": { "type": "float" }, "feels_like": { "type": "float" }, "pressure": { "type": "float" }, "humidity": { "type": "float" }, "dew_point": { "type": "float" } } }, "clouds": { "properties": { "all": { "type": "integer" } } }, "weather": { "properties": { "id": { "type": "integer" }, "main": { "type": "keyword" }, "description": { "type": "text" }, "icon": { "type": "keyword" } } }, "wind": { "properties": { "speed": { "type": "float" }, "deg": { "type": "integer" } } }, "timestamp": { "type": "date", "format": "yyyy-MM-dd HH:mm:ss||epoch_millis" }, "name": { "type": "keyword" }, "coord": { "type": "geo_point" } } } } ``` With the Python script, bulk load the data into the index (you cannot drag files larger than 100 MB). In Kibana, on the _Dashboard_ view, create an _Aggregation based_ visualization with the maximum temperature and accumulated precipitation per year. Draw a threshold line at 33°C. ### Searching MovieLens Data Download the [Small MovieLens Latest data set](https://grouplens.org/datasets/movielens/) that is "recommended for education and development". Run **preprocess_movielens_latest_small.py** to create an **.ndjson** file for an Elasticsearch bulk import. Bulk import the movies. ```bash curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary "@movies.ndjson" ``` Check the inapt mapping (e.g., _year_ is of type _long_) Define a **movies** index with the following mapping and re-import the movies data. Note that in Elasticsearch, there is no dedicated [array](https://www.elastic.co/guide/en/elasticsearch/reference/current/array.html) data type. ```bash PUT movies { "settings": { "index": { "analysis": { "analyzer": { "custom_analyzer": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase", "custom_edge_ngram"] } }, "filter": { "custom_edge_ngram": { "type": "edge_ngram", "min_gram": 2, "max_gram": 10 } } } } }, "mappings": { "properties": { "id": { "type": "long" }, "title": { "type": "text", "analyzer": "custom_analyzer", "search_analyzer": "standard" }, "year": { "type": "date", "format": "yyyy" }, "genres": { "type": "keyword" } } } } ``` Experience autocompletion based on the ngrams as follows. ```bash GET /movies/_search { "query": { "match_phrase": { "title": { "query": "Star Wa" } } } } ``` Demonstrate autocompletion Web app in **apps/autocompletion**. ### Load Amazon Products via Logstash Copy the Amazon products data (i.e. **data/amazon-products**) including the Logstash configuration for this data into the Logstash data folder (i.e. **logstash/data**). ```bash docker cp .\data\amazon-products elasticsearch-logstash-1:/usr/share/logstash/data/amazon-products ``` Create an **amazon** index as follows. ```bash DELETE /amazon PUT /amazon { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "analysis": { "analyzer": {} } }, "mappings": { "properties": { "id": { "type": "keyword" }, "title": { "type": "text" }, "description": { "type": "text" }, "manufacturer": { "type": "text", "fields": { "raw": { "type": "keyword" } } }, "price": { "type": "scaled_float", "scaling_factor": 100 } } } } ``` Log into Logstash Docker container, remove the **.lock** file, and run the Logstash configuration. ```bash docker exec -it elasticsearch-logstash-1 bash cd /usr/share/logstash/data ls -lah rm -rf .lock cd .. bin/logstash -f data/amazon-products/logstash.conf ``` Once the import is complete, hit _Ctrl + C_ to break out of Logstash. Otherwise it would wait for data to be appended to the input file. ### Index the Works of William Shakespeare Create an index with mappings properties. ```json GET /_cat/indices PUT /shakespeare { "mappings": { "properties": { "speaker": { "type": "keyword" }, "play_name": { "type": "keyword" }, "line_id": { "type": "integer" }, "speech_number": { "type": "integer" } } } } GET /shakespeare/_mapping ``` [Bulk index](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html) life's work. ```bash curl -H "Content-Type: application/json" -XPOST 127.0.0.1:9200/shakespeare/_bulk --data-binary @shakespeare.json ``` Query life's work. ```json GET /shakespeare/_count GET /shakespeare/_search GET /shakespeare/_search { "query": { "match_phrase": { "text_entry": "to be or not to be" } } } ```