Our bots - svachalit programas
A bot (short for
"robot") is a program that operates as an agent for a user or another
program or simulates a human activity.
On
the Internet, the most ubiquitous bots are the programs, also called Spiders or Crawlers that access Web
sites and gather their content for Search Engine indexes.
BOT-
A bot is an automated software application which typically
performs tasks over the Internet. There is virtually an unlimited number of
bots performing a dizzying array of tasks. In the world of online marketing, we
see bots used to crawl websites, scrape content, check search rankings, and
automate social media and much more.
In some cases bots
are good things offering automation which enables huge advances in access to
information (think search engine bots). In other cases bots can be “evil” by
stealing content, breaking websites and costing business owners a ton of money.
It is
always best to have permission before using any bot on a website other than
your own.
Twitterbot
Twitterbot is a program used to produce automated posts on
the Twitter micro
blogging service, or to automatically follow Twitter users. Twitter bots
come in various forms. For example, many serve as spam, enticing
clicks on promotional links. Others post @replies or automatically "retweet" in response to tweets that include a
certain word or phrase. These automatic tweets are often seen as fun or silly.
Some Twitter users even program Twitter bots to assist themselves with
scheduling or reminders.
Features of a Twitterbot
It is sometimes desirable to
identify when a Twitter account is controlled by a bot. In a 2012 paper, Chu et
al. propose the following criteria that indicate that an account may be a bot
(they were designing an automated system):
·
"Periodic and regular timing" of tweets;
·
Whether the tweet content contains known spam; and
·
The ratio of tweets from mobile versus desktop, as compared to
an average human Twitter user.
There
are many different types of Twitter bots and their purposes vary from one to
another. Some bots may tweet helpful material such as @EarthquakesSF (description
below).
·
@chatmundo is an AI conversational Twitter bot based on
Program O which responds to @chatmundo mentions.
·
@WBEZbot Tweets the current programming on NPR’s Chicago
affiliate station, WBEZ.
·
@KookyScrit sends auto-reply tweets correcting misspellings
of the word "weird." @choose_this sends
at-replies to twitter users who tweet about making a choice between wide
varieties of things.
Twitter Bots
Automatically follow un-follow, retweet, and gather IDs and more.
Automatically follow un-follow, retweet, and gather IDs and more.
The Good: Build a following fast and automatically generate content for
your tweets (or retweets).
The Bad: If your followers realize they’ve followed a bot you’re not
likely to make many friends. Excessive bot usage can get you banned from
Twitter.
Examples: tweet adder, Injek Twit.
Examples: tweet adder, Injek Twit.
Website Scrapers
Identify and download specific strings of text or images from a website.
Identify and download specific strings of text or images from a website.
The Good: Save marketing admins countless hours of copy and paste work by
scraping website content yourself.
The Bad: Scraping is normally associated with taking other people’s
content to use as your own or scraping a website without permission or in
violation of the website’s TOS or AUP.
Examples: Website Content Extractor, Fetch.com
Examples: Website Content Extractor, Fetch.com
Website Crawlers /
Scrapers – Search Engines (indexing software)
Bots sent by search engines to browse and store content from your website. This content is then used to help rank your website on said search engine.
Bots sent by search engines to browse and store content from your website. This content is then used to help rank your website on said search engine.
The Good: Search engines exist, and I’m guessing you get a good percentage
of your traffic from search engines.
The Bad: You have to learn how this little bot works to make sure your
site is accessible and easily navigated by the various bots. Think linking
structure and indexability.
Examples: 80legs
Examples: 80legs
Search Rank Checkers
Used to check the position of your organic and paid listings on search engines.
Used to check the position of your organic and paid listings on search engines.
The Good: Learn how you’re ranking across a variety of search engines,
countries, languages and more. When used appropriately with an API you will not
have to worry about TOS/AUP violations.
The Bad: When not used appropriately, or without an API, search rank
checkers may get your IP banned or worse, your website. Be nice to search
engines and try to play by the rules.
Examples: Rank Tracker, Rank Reporter, Web Position Reporter
Examples: Rank Tracker, Rank Reporter, Web Position Reporter
Facebook Bots
Mass friend requests, messaging, wall posting, poking, status updates and more.
The Good: Build your friend / fan lists quickly and automatically update your status, images and more.
Mass friend requests, messaging, wall posting, poking, status updates and more.
The Good: Build your friend / fan lists quickly and automatically update your status, images and more.
The Bad: Everything else. Automated mass friend requesting is a good way
to get your account banned and automated pokes just sounds painful.
Example: Facebook Blaster Pro
Example: Facebook Blaster Pro
Comment Spam Bots
Used to post comment spam on blogs, forums and news websites for the purpose of link building.
The Good: None.
Used to post comment spam on blogs, forums and news websites for the purpose of link building.
The Good: None.
The Bad: Your blog is taken over by comments like “I really like your
article. You should check out my website about Cialis www.cialisisawesome.com.”
Also, if you get caught using a comment spam spot Matt Cutts is likely to ban
you from Google and flame you on his blog. Comment spams that blog post in
retort.
PPC Bots
Click on your competitor’s ad, influence bounce rates, and generally cause havoc with Ad words accounts.
The Good: None
Click on your competitor’s ad, influence bounce rates, and generally cause havoc with Ad words accounts.
The Good: None
The Bad: Not only are people, who use PPC bots defrauding advertisers,
they are also at great risk of being banned by search engines.
Link Building Bots
Find websites and automatically email webmasters requesting back links.
The Good: Build links automatically.
Find websites and automatically email webmasters requesting back links.
The Good: Build links automatically.
The Bad: You’ll end up annoying countless webmasters who may end up
posting negative information about your website. Couple this with the fact that
you’ll be a low rate spammer a hair’s breadth away from being banned by most
search engines and link building bots are better left alone.
Bots offer a
powerful way for you to automate repetitive tasks; however, think long and hard
about using bots with third party websites or services. As mentioned before,
check the terms of service, Acceptable Use Policy and any applicable laws. It
is always best to have permission before using a bot on any website other than
your own.
What You Should Know About Google Bots and SEO
Ad clicks aren’t the only
pressing issue for websites when it comes to online marketing and bots – since tools
like Google analytics don’t provide granular insight into site traffic, it can
be very tricky to differentiate between human and non-human traffic unless you
really drill down, or notice after the fact that something is amiss. Even more
difficult is deciphering between malicious bots that can harm your website and
good bots, like Googlebot, that improve SEO.
If you run an e-commerce site has website or a personal blog,
chances are you want Google to visit your site and index your content as often as
possible. By doing so, Google learns what is new on your site and can
immediately share updated content with the online community. However,
differentiating between Google and hackers who impersonate Google can be a
major challenge for website operators and can have a damaging impact on your
online business.
The Googlebot has a very unique way of
identifying itself. It uses a specific user agent that arrives from IP
addresses belonging to Google and always adheres to the robots.txt (the
crawling instructions that website owner provide to such bots).
Google
uses a robot called “Googlebot” that crawls millions of sites simultaneously
and indexes their content in Google’s databases. The more Googlebot visits your
site, the faster your site’s content updates will appear in Google’s search
results. It’s crucial to allow Googlebot to crawl your website without blocking
or disturbing it, and many companies invest in special SEO tools to attract it.
What is Robots.txt?
The robots exclusion protocol (REP), or robots.txt is a text
file webmasters create to instruct robots (typically search engine robots) how
to crawl and index pages on their website.
Cheat Sheet
Block all web crawlers from all content
User-agent: *
Disallow: /
Block a specific web crawler from a specific folder
User-agent: Googlebot
Disallow: /no-google/
Block a specific web crawler from a specific web
page
User-agent: Googlebot
Disallow: /no-google/blocked-page.html
Sitemap Parameter
User-agent: *
Disallow:
Sitemap:
http://www.example.com/none-standard-location/sitemap.xml
Optimal Format
Robots.txt needs to be placed in the top-level directory of a
web server in order to be useful. Example: http://www.example.com/robots.txt
What is Robots.txt?
The Robots Exclusion Protocol (REP) is a group of web standards
that regulate web robot behavior and search engine indexing. The REP consists
of the following:
- The original REP from 1994,
extended 1997, defining
crawler directives for robots.txt. Some search engines support extensions
like URI patterns (wild cards).
- Its extension from 1996 defining
indexer directives (REP tags) for use in the robots Meta element, also
known as "robots meta tag." Meanwhile, search engines support
additional REP tags with an X-Robots-Tag. Webmasters can apply REP tags in
the HTTP header of non-HTML resources like PDF documents or images.
- The Microformat rel-nofollow from 2005 defining how search engines
should handle links where the A Element's REL
attributes contains the value "nofollow."
Robots Exclusion Protocol Tags
Applied to an URI, REP tags (noindex, nofollow,
unavailable_after) steer particular tasks of indexers, and in some cases
(nosnippet, noarchive, noodp) even query engines at runtime of a search query.
Other than with crawler directives, each search engine interprets REP tags
differently. For example, Google wipes out even URL-only listings and ODP
references on their SERPs when a resource is tagged with "noindex,"
but Bing sometimes lists such external references to forbidden URLs on their
SERPs. Since REP tags can be supplied in META elements of X/HTML contents as
well as in HTTP headers of any web object, the consensus is that contents of
X-Robots-Tags should overrule conflicting directives found in META elements.
Microformats
Indexer
directives put as microformats will overrule page settings for particular HTML
elements. For example, when a page's X-Robots-Tag states "follow"
(there's no "nofollow" value), the rel-nofollow directive of a
particular an element (link) wins.
Although
robots.txt lacks indexer directives, it is possible to set indexer directives
for groups of URIs with server sided scripts acting on site level that apply
X-Robots-Tags to requested resources. This method requires programming skills
and good understanding of web servers and the HTTP protocol.
Pattern
Matching
Google and Bing both honor two regular expressions that can be
used to identify pages or sub-folders that an SEO wants excluded. These two
characters are the asterisk (*) and the dollar sign ($).
- * - which is a
wildcard that represents any sequence of characters
- $ - which
matches the end of the URL
Public Information
The robots.txt file is public—be aware that a robots.txt file is
a publicly available file. Anyone can see what sections of a server the
webmaster has blocked the engines from. This means that if an SEO has private
user information that they don’t want publicly searchable, they should use a
more secure approach—such as password protection—to keep visitors from viewing
any confidential pages they don't want indexed.
Important Rules
- In most cases, meta robots with
parameters
"noindex, follow"
should be employed as a way to to restrict crawling or indexation. - It is important to note that malicious
crawlers are likely to completely ignore robots.txt and as such, this protocol does
not make a good security mechanism.
- Only one "Disallow:"
line is allowed for each URL.
- Each subdomain on a root domain
uses separate robots.txt files.
- Google and Bing accept two
specific regular expression characters for pattern exclusion (* and $).
- The filename of robots.txt is
case sensitive. Use "robots.txt", not "Robots.TXT."
- Spacing is not an accepted way to
separate query parameters. For example, "/category/ /product
page" would not be honored by robots.txt.
SEO Best Practice
Blocking Page
There are a few ways to block search engines from accessing a
given domain:
Block
with Robots.txt
This tells the engines not to crawl the given URL, but that they
may keep the page in the index and display it in in results. (See image of
Google results page below.)
Block with Meta NoIndex
This tells engines they can visit, but are not allowed to
display the URL in results. This is the recommended method.
Block by Nofollowing Links
This is almost always a poor tactic. Using this method, it is
still possible for the search engines to discover pages in other ways: through
browser toolbars, links from other pages, analytics, and more.
Why Meta Robots is better than
Robots.txt?
Below is an example of about.com's robots.txt file. Notice that
they are blocking the directory /library/nosearch/.
Now notice what happens when the URL is searched for in Google.
Google has 2,760 pages from that "disallowed"
directory. The engine hasn't crawled these URLs, so it appears as a URL rather
than a traditional listing.
This becomes a problem when these pages accumulate links. Those
pages than can accumulate link juice (ranking power) and other
query-independent ranking metrics (like popularity and trust),
but these pages can't pass these benefits to any other pages since the links on
them don't ever get crawled.
In order to exclude individual pages from search engine indices,
the noindex meta tag
<
meta
name="robots" content="noindex">
is actually superior
to robots.txt.
No comments:
Post a Comment