Using uri_spider to parse file metadata
April 11, 2019
The uri_spider task, when given a Uri entity such as http://www.acme.com, will spider a site to a specified level of depth (max_depth), a specified max number of pages (limit), and if configured, a specified url pattern (spider_whitelist). When configured – and by default – it will extract DnsRecord types, PhoneNumbers and EmailAddress type entities in the content of the page. All spidered Uris can be can created as entities using the extract_uris option.
Further, the spider will identify any files of the types listed below, and parse their content and metadata for the same types. Because this file parsing uses the excellent Apache Tika under the hood, the number and type of supported file formats is huge – over 300 file formats are supported including common formats like doc, docx and pdf – as well as more exotic types like application/ogg and many video formats. To enable this, simply enable the parse_file_metadata option.
Below, see a screenshot of the task’s configuration:
Note that you can also take advantage of Intrigue Core’s file parsing capabilities on a Uri by Uri basis by pointing the uri_extract_metadata task at a specific Uri with a file you’d like parsed, such at https://acme.com/file.pdf