Information Gathering
***This blog is connected with one of my YouTube videos. There I have explained about OWASP and this list of OWASP. I'm recommending you to watch it before you read this Blog***
Here, we are understanding the deployed configuration of the server hosting the web application
OTG-INFO-001 : Conduct Search Engine Discovery/ Reconnaissance for Information Leakage
There are direct and indirect elements for search engine discovery.
Direct Methods related to searching in indexes and the associated content from caches while Indirect Methods related to collecting sensitive design and configuration info by searching forums, newsgroups and tendering websites.
There are crawlers in search engines. Once these crawlers are completed crawling, it starts indexing the web page based on the tags and associated attributes. Such as <title>, <head> and so on. Crawlers are going through these indexes, in order to give us the relevant information.
Before Going forward....
1. What are web crawlers?
Web crawler is a bot who downloads and indexes content from all over the internet. These bots are almost operated by the Search Engines. By applying search algorithm to the data collected by the web crawlers, search engines can provide relevant links as the response for their user queries.
2. Why Crawlers?
To Learn what every webpage on the web is about, so the information requested by the user can be retrieved when it is required.
3. What is robots.txt file?
Robots.txt is a text file instruct web robots (typically search engine robots) how to crawl pages on their website. Basically, robots.txt file indicates whether certain user agents/web browsers can or cannot send their crawlers in certain parts of a website. Crawl instructions are specified by "allowing" and "disallowing" the behavior of certain user agents.
Syntax >>> User-agent: [user-agent name] Disallow: [URL string not to be crawled]
4. How to see the robots.txt file?
Syntax >>> https://<domain_name>/robots.txt
Following is an example
4. More about robots.txt file....
The robots.txt file is part of the the Robots Exclusion Protocol (REP). REP is a group if web standards that regulates how robots crawl the web, accessing and indexing content and serve that content to the users.
More reading about robots.txt >>> Click Here
If the robots.txt file is not updated during the time of the website, and inline HTML metatags that instruct robots not to index content have not been used, then it is possible to for indexes to contain web content not intended to be included in by the owners.
Website owners may use the previously mentioned robots.txt, HTML meta tags, authentication and tools provided by search engines to remove such content.
Test Objectives
Here we are understanding what sensitive design and configuration information of the application/system/organization is exposed both directly(On the organization's website) and indirectly(On third party websites)
What to test
We have to use search engines to search for,
*Network Diagrams and configurations
*Archived posts and emails by administrators and other key staff
*Log on procedures and username formats
*Usernames and passwords
*Error message content
*Development, test, UAT(User Acceptance Test) and staging versions of the website
Search Engines
Do not limit testing to just one search engine provider as they may generate different results depending on when they crawled content and their own algorithms.
Consider following search engines
• Baidu
• binsearch.info
• Bing
• Duck Duck Go
• ixquick/Startpage
• Google
• Shodan
• PunkSpider
Duck Duck Go and ixquick/Startpage provide reduced information leakage about the tester.
PunkSpider is a web applicatin vulnerability search engine. It is of little use for penetration tester doing manual work. However, it can be useful as demonstration of easiness of finding vulnerablilities by script-kiddies
Google provides the Advanced “cache:” search operator [2], but this is the equivalent to clicking the “Cached” next to each Google Search Result. Hence, the use of the Advanced “site:” Search Operator and then clicking “Cached” is preferred.
Example : site:owasp.org
This will retrieve all the websites under the domain of owasp.org
To Display the index,html of owasp.org as cached the syntax is, cache:owasp.org
Google Hacking Database
This is a list of useful search queries for google. Those queries can e put in to different categories.
• Footholds
• Files containing usernames
• Sensitive Directories
• Web Server Detection
• Vulnerable Files
• Vulnerable Servers
• Error Messages
• Files containing juicy info
• Files containing passwords
• Sensitive Online Shopping Info
What are the options/ Remediation
Carefully consider the sensitivity of design and configuration information before it is posted online.
Periodically review the sensitivity of existing design and configuration information that is posted online.
Read about advanced google search >>> Click Here
Trust me, Google Advanced search will help you to get the best out from google. It will be an additional advantage if you are using google daily.
***I will post the OTG-INFO-002 in next week. Keep in touch***
Thank You!!!
Comments
Post a Comment