This Tutorial is for webmaster/programmers. By practicing simple tasks at the command line, you will learn the basics of how to:
To run the tutorial, you will need:
System requirements:
The following would be helpful before you start, but are not required:
The Tutorial shows the process for setting up an index and performing some example queries on a fictitious web site that includes a forum where members discuss video games.
Open your browser and go to http://www.searchify.com/plans and choose one of the plans.
Enter your email address and a desired password.
Result:
The Searchify dashboard appears, showing your new account:
Click new index.
In Index Name, type test_index.
Click Create Index.
Result:
The dashboard is displayed. In INDEX NAME, test_index
appears. In STATUS, you can see whether the index is ready to use:
Wait for a short time (typically less than one minute) to give Searchify time to set up the cloud resources for your index.
Click your browser's Refresh button.
If STATUS has changed to Running, you can proceed to the next part of the tutorial. If STATUS is still Initializing, wait a bit, then hit Refresh again.
Result:
You now have an empty index that is ready to be populated with content.
Click client documentation or go to http://www.searchify.com/documentation.
Click Python client library. This takes you to
http://www.searchify.com/documentation/python-client.
In the Download area, in Stable Version, click one of the links: zip for Windows or Mac, tgz for Unix/Linux.
Extract the compressed library to a convenient directory on your computer.
Open a console window and change (
cd
) to the directory where the client library is installed.
Result:
If you run the command to view the contents of the directory (
ls
on Unix or Mac OS X;
dir
on Windows), you should see the file
indextank_client.py
.
While still in the directory where the client library is installed, run the Python interpreter.
If you don't know how to use the Python interpreter, see
http://docs.python.org/tutorial/interpreter.html.
C:\> python
The Python prompt appears:
>>>
Import the IndexTank client library to the Python interpreter by typing the following command at the Python prompt.
>>> import indextank.client as itc
itc
is a name you give to the imported client so you can refer to it later.
Instantiate the client by calling
ApiClient()
.
>>> api_client = itc.ApiClient('YOUR_API_URL')
For
YOUR_API_URL
, substitute the URL from Private URL
in your Dashboard (refer to the screen shot at the beginning of the Tutorial if you forgot where to find this).
Get a handle to your test index.
>>> test_index = api_client.get_index('test_index')
Here we call the
get_index()
method in the client library and pass it the name you assigned when you created the index.
Add some documents to the index.
>>> test_index.add_document('post1', {'text':'I love Bioshock'}) >>> test_index.add_document('post2', {'text':'Need cheats for Bioshock'}) >>> test_index.add_document('post3', {'text':'I love Tetris'})
Here we call the
add_document()
method in the client library three times to index three posts in the video gamer forum.
post1
,
post2
, and
post3
are unique alphanumeric IDs you give to the documents. If the doc ID is not unique, you will overwrite the existing document with the same ID, so watch out for typos during this Tutorial!
The method parameters are name:value pairs that build up the index for one document. In this example, there is a single name:value pair for each forum post.
text
is a field name, and it has a special meaning to Searchify: in search queries, Searchify defaults to searching
text
if the query does not specify a different field. You'll learn more about this later, in Use Fields to Divide Document Text.
Result:
test_index now contains:
Doc ID | Field | Value |
---|---|---|
post1 | text | I love Bioshock |
post2 | text | Need cheats for Bioshock |
post3 | text | I love Tetris |
text
,
so the search query doesn't have to specify a field.
Suppose you're interested in a particular game, and you want to find all the posts that contain Bioshock:
>>> test_index.search('Bioshock')
The output should look like this (you can ignore
facets
for now). The search term was found in post1 and post2:
{'matches': 2, 'facets': {}, 'search_time': '0.070', 'results': [{'docid': 'post2'},{'docid': 'post1'}]}
Suppose you want to find only the true enthusiasts on the forum. You can search for posts that contain Bioshock and love.
>>> test_index.search('love Bioshock')
The output should look like this. The two search terms were found together only in post1:
{'matches': 1, 'facets': {}, 'search_time': '0.005', 'results': [{'docid': 'post1'}]}
You can also use the query operators OR and NOT. Let's try OR, which would come in handy if you play more than one game and you want to find posts that mention any of your favorites.
>>> test_index.search('Bioshock OR Tetris')
The output should look like this:
{'matches': 3, 'facets': {}, 'search_time': '0.007', 'results': [{'docid': 'post3'},{'docid': 'post2'},{'docid': 'post1'}]}
To ask Searchify to return more than just the doc ID, add the argument
fetch_fields
.
>>> test_index.search('love', fetch_fields=['text'])['results']
Here,
text
means we would like the full text of the document where the search term was found. This would be useful, for example, to construct an output page that provides complete result text for the reader to look at. The output should look like this:
[{'text': 'I love Tetris', 'docid': 'post3'}, {'text': 'I love Bioshock', 'docid': 'post1'}]
To show portions of the result text with the search term highlighted, use
snippet_fields
.
>>> test_index.search('love', snippet_fields=['text'])['results']
The output should look like this:
[{'snippet_text': 'I <b>love</b> Tetris', 'docid': 'post3'}, {'snippet_text': 'I <b>love</b> Bioshock', 'docid': 'post1'}]
So far, we have worked with simple document index entries that contain only a single field,
text
,
containing the complete text of the document. Let's redefine the documents now and add some more fields to enable more targeted searching.
Set up two fields for each document: the original
text
field plus a new field,
game
,
that contains the name of the video game that is the subject of the forum post.
>>> test_index.add_document('post1', {'text':'I love Bioshock', 'game':'Bioshock'}) >>> test_index.add_document('post2', {'text':'Need cheats for Bioshock', 'game':'Bioshock'}) >>> test_index.add_document('post3', {'text':'I love Tetris', 'game':'Tetris'})
Here we call
add_document()
with the same document IDs as before, so Searchify will overwrite the existing entries in your test index.
Result:
test_index now contains:
Doc ID | Field | Value |
---|---|---|
post1 | text | I love Bioshock |
game | Bioshock | |
post2 | text | Need cheats for Bioshock |
game | Bioshock | |
post3 | text | I love Tetris |
game | Tetris |
Now you can search within a particular field. Let's use
fetch_fields
again to get some user-friendly output.
>>> test_index.search('game:Tetris', fetch_fields=['text'])['results']
Note that we can return the contents of a field even if it is not being searched. The output should look like this:
[{'text': 'I love Tetris', 'docid': 'post3'}]
A scoring function
is a mathematical formula that you can reference in a query to influence the ranking of search results. Scoring functions are named with integers starting at 0 and going up to 5. Function 0 is the default and will be applied if no other is specified; it starts out with an initial definition of
-age
,
which sorts query results from most recently indexed to least recently indexed (newest to oldest).
Function 0 uses the
timestamp
field which Searchify provides for each document. The time is recorded as the number of seconds since epoch. Searchify automatically sets each document's timestamp to the current time when the document is indexed, but you can override this timestamp. To make this scoring function tutorial easier to follow, that's what we are going to do.
Assign timestamps to some new posts in the index.
>>> test_index.add_document('newest',{'text': 'New release: Fable III is out','timestamp':1286673129}) >>> test_index.add_document('not_so_new',{'text': 'New release: GTA III just arrived!','timestamp':1003626729}) >>> test_index.add_document('oldest',{'text': 'New release: This new game Tetris is awesome!','timestamp':455332329})
Search using the default scoring function.
>>> test_index.search('New release')
The output should look like this. The default scoring function has sorted the documents from newest to oldest:
{'matches': 3, 'facets': {}, 'search_time': '0.002', 'results': [{'docid': 'newest'},{'docid': 'not_so_new'},{'docid': 'oldest'}]}
Redefine function 0 to sort in the opposite order, by removing the negative sign from the calculation.
>>> test_index.add_function(0,'age')
Search again.
>>> test_index.search('New release')
The output should look like this. The oldest document is now first:
{'matches': 3, 'facets': {}, 'search_time': '0.005', 'results': [{'docid': 'oldest'},{'docid': 'not_so_new'},{'docid': 'newest'}]}
Let's try creating another scoring function, function 1, using a different Searchify built-in score called
relevance
.
Relevance is calculated using a proprietary algorithm, and indicates which documents best match a query. First, add some test documents that will more clearly illustrate the effect of the relevance score.
>>> test_index.add_document('post4', {'text': 'When is Duke Nukem Forever coming out? I need my Duke.'}) >>> test_index.add_document('post5', {'text': 'Duke Nukem is my favorite game. Duke Nukem rules. Duke Nukem is awesome. Here are my favorite Duke Nukem links.'}) >>> test_index.add_document('post6', {'text': 'People who love Duke Nukem also love our great product!'})
Now define function 1.
>>> test_index.add_function(1,'relevance')
Search using the new scoring function.
>>> test_index.search('duke', scoring_function=1, fetch_fields=['text'])['results']
The output should look like this. The most relevant document is now first:
[{'docid': 'post5', 'text': 'Duke Nukem is my favorite game. Duke Nukem rules. Duke Nukem is awesome. Here are my favorite Duke Nukem links.'}, {'docid': 'post4', 'text': 'When is Duke Nukem Forever coming out? I need my Duke.'}, {'docid': 'post6', 'text': 'People who love Duke Nukem also love our great product!'}]
In addition to textual information, each document can have up to three (3) document variables to store any numeric data you would like. Each variable is referred to by number, starting with variable 0. Document variables provide additional useful information to create more subtle and effective scoring functions.
For example, assume that in the video game forum, members can vote for posts that they like. The forum application keeps track of the number of votes. These vote totals can be used to push the more popular posts up higher in search results.
Let's also assume that the forum software assigns a spam score by examining each new post for evidence that it is from a legitimate forum member and contains relevant content, and then assigning a confidence value from 0 (almost certainly spam) to 1 (high confidence that the post is legitimate).
Assign the total votes to document variable 0 and the spam score to document variable 1.
>>> test_index.add_document('post4', {'text': 'When is Duke Nukem Forever coming out? I need my Duke.'}, variables={0:10, 1:1.0}) >>> test_index.add_document('post5', {'text': 'Duke Nukem is my favorite game. Duke Nukem rules. Duke Nukem is awesome. Here are my favorite Duke Nukem links.'}, variables={0:1000, 1:0.9}) >>> test_index.add_document('post6', {'text': 'People who love Duke Nukem also love our great product!'}, variables={0:1, 1:0.05})
Use the document variables in a scoring function.
>>> test_index.add_function(2, 'relevance * log(doc.var[0]) * doc.var[1]')
Run a query using the scoring function.
>>> test_index.search('duke', scoring_function=2, fetch_fields=['text'])['results']
The output should look like this:
[{'docid': 'post5', 'text': 'Duke Nukem is my favorite game. Duke Nukem rules. Duke Nukem is awesome. Here are my favorite Duke Nukem links.'}, {'docid': 'post4', 'text': 'When is Duke Nukem Forever coming out? I need my Duke.'}, {'docid': 'post6', 'text': 'People who love Duke Nukem also love our great product!'}]
When more readers vote for a post, update the vote total in variable 0.
>>> test_index.update_variables('post4',{0:1000000})
Now run the query again with the same scoring function.
>>> test_index.search('duke', scoring_function=2, fetch_fields=['text'])['results']
The output should show the new most-popular post first:
[{'docid': 'post4', 'text': 'When is Duke Nukem Forever coming out? I need my Duke.'}, {'docid': 'post5', 'text': 'Duke Nukem is my favorite game. Duke Nukem rules. Duke Nukem is awesome. Here are my favorite Duke Nukem links.'}, {'docid': 'post6', 'text': 'People who love Duke Nukem also love our great prod]
If you're 100% confident something should not be in the index, it makes sense to remove it.
Take out that spam document.
>>> test_index.delete_document('post6')
Search again to confirm the deletion.
>>> test_index.search('duke', scoring_function=2, fetch_fields=['text'])['results']
The output should show only two results:
[{'docid': 'post4', 'text': 'When is Duke Nukem Forever coming out? I need my Duke.'}, {'docid': 'post5', 'text': 'Duke Nukem is my favorite game. Duke Nukem rules. Duke Nukem is awesome. Here are my favorite Duke Nukem links.'}]
You can pass variables with a query and use them as input to a scoring function. This is useful, for example, to customize results for a particular user. Suppose we're dealing with the search on the forum site. It makes sense to index the poster's gamerscore to use it as part of the matching process.
Add a document variable. This will be compared to the query variable later. Here variable 0 holds the gamerscore of the person who made the post:
>>> test_index.add_document('post1', {'text':'I love Bioshock'}, variables={0:115}) >>> test_index.add_document('post2', {'text':'Need cheats for Bioshock'}, variables={0:2600}) >>> test_index.add_document('post3', {'text':'I love Tetris'}, variables={0:19500})
Suppose we want to boost posts from forum members with a gamerscore closer to that of the searcher. Let's define a scoring function to do that.
>>> test_index.add_function(1, 'relevance / max(1, abs(query.var[0] - doc.var[0]))')
This scoring function prioritizes gamerscores close to the searcher's own (query.var[0]
).
The absolute (abs
)
function is there to provide symmetry for gamerscores above and below the searcher's. The
max
function ensures that we never divide by 0, and evens out all gamerscore differences less than 1 (in case we use floats for the gamerscores, instead of integers).
Suppose John is the searcher, with a gamerscore of 25. In the query, set variable 0 to the gamerscore.
>>> test_index.search('bioshock', scoring_function=1, fetch_fields=['text'], variables={0: 25})['results']
The output should look like this:
[{'docid': 'post1', 'text': 'I love Bioshock.'}, {'docid': 'post2', 'text': 'Need cheats for Bioshock.'}]
For Isabelle, gamerscore 15,000:
>>> test_index.search('love', scoring_function=1, fetch_fields=['text'], variables={0: 15000})['results']
The output should look like this:
[{'docid': 'post3', 'text': 'I love Tetris.'}, {'docid': 'post1', 'text': 'I love Bioshock.'}]
Now that you have learned some of the basic functionality of Searchify, you are ready to go more in-depth:
Enjoy using Searchify to improve the quality of search on your website.