Web as Corpus
This site is moving to a new provider. You will get faster, more dependable service at this temporary URL:
http://199.19.225.121/ . When testing is complete,
http://webascorpus.org will point to the new site.
- Web Concordancer details below; feedback welcome
alternate site if response slow
- Search the Web directly for concordances of words and phrases in 34 different languages.
- This new release (last update: 24 May 2010)
adds support for selecting
which documents to include in the zipfile, preselection based on document
metrics, combining all textfiles into a single document for importing into kfNgram or a
concordancer, and conversion from UTF-8 into more widely-supported
encodings. If it does not work properly for your language, please let me know.
- Web Corpus
- English-language corpora compiled from the Web in 2006 and 2007.
-
2007 still under development, currently 3,123,996 types and
518,129,710 tokens; target size at least 1,000,000,000 tokens; will be
part-of-speech tagged.
2006 97,198,272
tokens and 950,087 types; 1-6-grams; wildcard searchable; the original texts and URLs are no longer available
due to a hard drive failure.
- Search these Web Corpora
- Count Matching Webpages
- Count how many hits Bing and Yahoo! report for a word or phrase, expressed
both as an absolute number and as number of matches per million webpages.
Multiple search terms can be entered and queried at the same time, and
numbers can be either formatted for easier reading or left unformatted
for copying and pasting into a spreadsheet or database.
As you can see by comparing results
from these two search engines, such counts must be interpreted with extreme
caution! Bing numbers per million pages are generally smaller
than those from Yahoo!, probably due to an over-optimistic estimate of the
total number of pages in Bing's database.
- Latest Changes
- Wiki detailing additions and tweaks to this site
- Web as Corpus Wiki
- Wiki with links to web as corpus events, sites and code
- Find Search Terms
- Search by wildcard in various databases for single-word English search terms (e.g. morphological variants) for pasting into the Advanced Query field
Related papers
Ready-made frequency lists from the Web
English
-
Web Corpus 2006 – 100 or more HTML
- HTML version of list of 30,524 types occurring 100 or more times
in this corpus
-
Web Corpus 2006 – 100 or more TAB
- Tab-separated text version of list of 30,524 types occurring 100 or
more times
-
Web Corpus 2006 – 10 or more TAB
- Tab-separated text version of list of 104,675 types occurring 10 or
more times
Dutch & Afrikaans
Major Search Engines did not distinguish between Dutch and
Afrikaans: except for Google, they do not provide for searching only for pages in Afrikaans, and
searches for pages in Dutch usually return some pages in Afrikaans as well.
National domains (.nl, .be / .za) are only a rough guide to location.
International domains like .com, .net, .biz, .info etc. provide no clue to
the source. These lists were compiled to test various algorithms to
distinguish Afrikaans from Dutch pages.
-
Dutch Web Corpus 2006 –
1-grams
- HTML version of list of 102,770 types occurring in a pilot corpus of
1,605,346 tokens (6.4 MB)
-
Afrikaans Web Corpus 2006 – 1-grams
- HTML version of list of 62,785 types occurring in a pilot corpus of
1,263,509 tokens (3.9 MB)
Museum
Some of my ancient papers in which there has been a renewed interest
-
'Blood-hot', 'stone-good': a preliminary report on adjective-specific intensifiers in Dutch
- Leuvense Bijdragen, 69 (1980), 445-472. (PDF, 12.6 MB)
Please help support this site by acquiring the innovative multilingual
Visual Thesaurus.
WebAsCorpus.org only receives credit if you sign up via this link.
http://webascorpus.org
launched 7 February 2007, updated 30 March 2011
Background: driftwood, Dares Beach, Maryland –
original image