Get desktop application:
View/edit binary Protocol Buffers messages
A single content file record.
The configuration for the contentfiles importer.
Used in:
A regular expression to match repo-relative file paths. E.g. the pattern "src/.*\.c" would match any files in a repository's 'src/' directory ending in suffix '.c'. This pattern must match the entire path. E.g. the pattern '\.c$' would match only a single file called '.c'. The start and end of match characters '^' and '$' are permitted but redundant.
A list of preprocessor passes to run on each imported source code, in the order in which they should be executed. A preprocesor is a Python function, decorated with the @datasets.github.scrape_repos.preprocessors.public.dataset_preprocessor decorator. THe name of the decorator is the fully qualified python module, followed by ':' and the name of the Python function, e.g. datasets.github.scrape_repos.preprocessors.extractors:JavaMethods See //datasets/github/scrape_repos/preprocessors/... for definitions.
The "metafile" schema. Each repository which is scraped produces one of these files, recording various attributes about the repository.
The number of milliseconds since the Unix epoch that the repository was scraped or cloned.
The GitHub username of the repository owner.
The name of the repository as it appears on GitHub.
The git URL to clone the repo.
The number of stargazers, forks, and watchers of the repository.
A GitHub query to search for repositories.
Used in:
The query string.
The maximum number of results to process for this query. Less results will be processed if the query returns less than this number of results. Note that the GitHub API may limit the maximum number of results which are returned (currently this limit is 1,000 results per query).
The schema for "clone lists", which are used to determine the GitHub queries to run and repos to clone.
A single programming language to clone repositories of.
Used in:
A list of queries to search for repositories on GitHub. Queries are searched in the order they appear, until num_repos_to_clone repositories have been cloned. For parameters, see the GitHub API docs: https://developer.github.com/v3/search/#search-repositories
The base directory to clone GitHub repositories to.
The configuration for the program which imports source code from cloned repositories into contentfile databases.
An optional list of HTTPS repo URLs to ignore, e.g. "https://github.com/ChrisCummins/phd.git". Before a repo is scraped, the URL is checked against this list, and ignored if it matches. Black listing is case sentive - only lower case letters should be used.
Used in:
Used by JavaMethodsExtractor to extract a list of methods from an input source.