Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create dataframe from Python Event transformer JSON data in PySpark #35

Open
ajitkshirsagar opened this issue Dec 4, 2018 · 0 comments

Comments

@ajitkshirsagar
Copy link

ajitkshirsagar commented Dec 4, 2018

I am trying to process the Snowplow enriched TSV events from Kafka in PySpark but getting |-- _corrupt_record: string (nullable = true) in PySpark while in Scala it is working fine.

TSV Event:

1	web	2018-11-27 16:03:34.110	2018-11-27 16:03:24.058	2018-11-27 16:03:19.677	unstruct	3fbbebd4-7e20-4506-990f-48d56eda67c5		cw	js-2.9.0	ssc-0.14.0-kafka	stream-enrich-0.15.0-common-0.31.0	undefined	178.151.40.38	509396495	17b2fee3-4dc4-470f-b502-3b42e9b340d0	42	18a362bd-00a1-40d3-93af-f5e07f38d805	UA	07	Kharkiv	61135	49.980804	36.2527	Kharkivs'ka Oblast'					https://test4.abc.com/seller/orders/11432903		https://test4.abc.com/seller/orders	https	test4.abc.com	443	/seller/orders/11432903			https	test4.abc.com	443	/seller/orders			internal								{"schema":"iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-0","data":[{"schema":"iglu:com.google.analytics/cookies/jsonschema/1-0-0","data":{"_ga":"GA1.2.929926098.1540998420"}},{"schema":"iglu:com.snowplowanalytics.snowplow/web_page/jsonschema/1-0-0","data":{"id":"6f904a0e-3408-47df-8d88-672a0adcc4aa"}},{"schema":"iglu:org.w3/PerformanceTiming/jsonschema/1-0-0","data":{"navigationStart":1543333678447,"unloadEventStart":1543333680428,"unloadEventEnd":1543333680428,"redirectStart":0,"redirectEnd":0,"fetchStart":1543333678449,"domainLookupStart":1543333678468,"domainLookupEnd":1543333678845,"connectStart":1543333678845,"connectEnd":1543333679031,"secureConnectionStart":1543333678910,"requestStart":1543333679032,"responseStart":1543333680418,"responseEnd":1543333680420,"domLoading":1543333680427,"domInteractive":1543333681271,"domContentLoadedEventStart":1543333681282,"domContentLoadedEventEnd":1543333681295,"domComplete":1543333682812,"loadEventStart":1543333682812,"loadEventEnd":1543333682814}}]}						{"schema":"iglu:com.snowplowanalytics.snowplow/unstruct_event/jsonschema/1-0-0","data":{"schema":"iglu:com.snowplowanalytics.snowplow/link_click/jsonschema/1-0-1","data":{"targetUrl":"https://test4.abc.com/payments","elementId":"","elementClasses":["c-header__links__link"],"elementTarget":""}}}								Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0						en-US										1	24	1536	728				Europe/Berlin			1536	864	UTF-8	1519	1646								Europe/Zaporozhye		2018-11-27 16:03:22.970			{"schema":"iglu:com.snowplowanalytics.snowplow/contexts/jsonschema/1-0-1","data":[{"schema":"iglu:com.snowplowanalytics.snowplow/ua_parser_context/jsonschema/1-0-0","data":{"useragentFamily":"Firefox","useragentMajor":"63","useragentMinor":"0","useragentPatch":null,"useragentVersion":"Firefox 63.0","osFamily":"Windows","osMajor":null,"osMinor":null,"osPatch":null,"osPatchMinor":null,"osVersion":"Windows","deviceFamily":"Other"}}]}	423177aa-61a0-4646-99ce-1b7a3e1a2e64	2018-11-27 16:03:20.765	com.snowplowanalytics.snowplow	link_click	jsonschema	1-0-1	8842334059245717a355e23ea32a6b2d

JSON event after Python SDK Transform:

{'v_tracker': u'js-2.9.0', 'geo_timezone': u'Europe/Zaporozhye', 'page_referrer': u'https://test4.abc.com/seller/orders', 'page_urlscheme': u'https', 'etl_tstamp': u'2018-11-27T16:03:34.110Z', 'br_lang': u'en-US', 'app_id': u'1', 'contexts_com_snowplowanalytics_snowplow_ua_parser_context_1': [{u'osMajor': None, u'useragentMinor': u'0', u'osMinor': None, u'osPatch': None, u'osFamily': u'Windows', u'useragentPatch': None, u'osPatchMinor': None, u'useragentMajor': u'63', u'deviceFamily': u'Other', u'osVersion': u'Windows', u'useragentVersion': u'Firefox 63.0', u'useragentFamily': u'Firefox'}], 'geo_region_name': u"Kharkivs'ka Oblast'", 'derived_tstamp': u'2018-11-27T16:03:20.765Z', 'refr_urlhost': u'test4.abc.com', 'collector_tstamp': u'2018-11-27T16:03:24.058Z', 'refr_urlpath': u'/seller/orders', 'event': u'unstruct', 'user_fingerprint': u'509396495', 'user_id': u'undefined', 'domain_userid': u'17b2fee3-4dc4-470f-b502-3b42e9b340d0', 'br_colordepth': u'24', 'geo_country': u'UA', 'event_id': u'3fbbebd4-7e20-4506-990f-48d56eda67c5', 'event_format': u'jsonschema', 'refr_urlport': 443, 'doc_height': 1646, 'v_etl': u'stream-enrich-0.15.0-common-0.31.0', 'platform': u'web', 'refr_urlscheme': u'https', 'geo_location': u'49.980804,36.2527', 'doc_charset': u'UTF-8', 'page_urlhost': u'test4.abc.com', 'geo_region': u'07', 'domain_sessionidx': 42, 'br_viewheight': 728, 'geo_city': u'Kharkiv', 'geo_zipcode': u'61135', 'event_name': u'link_click', 'dvce_screenheight': 864, 'page_urlpath': u'/seller/orders/11432903', 'doc_width': 1519, 'event_vendor': u'com.snowplowanalytics.snowplow', 'br_cookies': True, 'name_tracker': u'cw', 'network_userid': u'18a362bd-00a1-40d3-93af-f5e07f38d805', 'contexts_org_w3_performance_timing_1': [{u'secureConnectionStart': 1543333678910, u'redirectStart': 0, u'domContentLoadedEventStart': 1543333681282, u'domContentLoadedEventEnd': 1543333681295, u'redirectEnd': 0, u'fetchStart': 1543333678449, u'navigationStart': 1543333678447, u'domainLookupEnd': 1543333678845, u'connectEnd': 1543333679031, u'unloadEventEnd': 1543333680428, u'requestStart': 1543333679032, u'responseEnd': 1543333680420, u'unloadEventStart': 1543333680428, u'domLoading': 1543333680427, u'connectStart': 1543333678845, u'loadEventStart': 1543333682812, u'responseStart': 1543333680418, u'loadEventEnd': 1543333682814, u'domComplete': 1543333682812, u'domInteractive': 1543333681271, u'domainLookupStart': 1543333678468}], 'geo_latitude': 49.980804, 'user_ipaddress': u'178.151.40.38', 'unstruct_event_com_snowplowanalytics_snowplow_link_click_1': {u'elementClasses': [u'c-header__links__link'], u'elementId': u'', u'elementTarget': u'', u'targetUrl': u'https://test4.abc.com/payments'}, 'domain_sessionid': u'423177aa-61a0-4646-99ce-1b7a3e1a2e64', 'dvce_screenwidth': 1536, 'refr_medium': u'internal', 'contexts_com_google_analytics_cookies_1': [{u'_ga': u'GA1.2.929926098.1540998420'}], 'geo_longitude': 36.2527, 'dvce_sent_tstamp': u'2018-11-27T16:03:22.970Z', 'br_viewwidth': 1536, 'v_collector': u'ssc-0.14.0-kafka', 'event_version': u'1-0-1', 'event_fingerprint': u'8842334059245717a355e23ea32a6b2d', 'contexts_com_snowplowanalytics_snowplow_web_page_1': [{u'id': u'6f904a0e-3408-47df-8d88-672a0adcc4aa'}], 'dvce_created_tstamp': u'2018-11-27T16:03:19.677Z', 'useragent': u'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0', 'os_timezone': u'Europe/Berlin', 'page_urlport': 443, 'page_url': u'https://test4.abc.com/seller/orders/11432903'}

Pyspark code:

from snowplow_analytics_sdk import event_transformer
input = sc.textFile("events.tsv")
jsons =  input.map(lambda line: event_transformer.transform(line))
df = spark.read.json(jsons)
df.printSchema()

Output:

>>> df.printSchema()
root
 |-- _corrupt_record: string (nullable = true)

Why I am getting corrupted records through Python SDK which works fine with Scala SDK ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant