Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird CSV Import defaulting to "h" delimiter #377

Open
Clint-Holt opened this issue Jan 3, 2025 · 1 comment
Open

Weird CSV Import defaulting to "h" delimiter #377

Clint-Holt opened this issue Jan 3, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@Clint-Holt
Copy link

Type: Bug

Behaviour

The Data Wrangler very frequently imports csv files using an "h" as a delimiter, which I would believe would be far far less frequent than a "," delimiter. I can easily switch the delimiter and then reimport my csv files but I wanted to mention it.

Expected vs. Actual

Expected: split csv files by commas not by "h" characters

Steps to reproduce:

  1. import the following text and it will probably import it using an "h" delimiter:
    Id,domain_no,hmm_species,chain_type,e-value,score,seqstart_index,seqend_index,identity_species,v_gene,v_identity,j_gene,j_identity,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127
    5-1_LC,0,human,K,2e-55,177.3,0,106,human,IGKV1-501,0.90,IGKJ101,0.83,D,I,Q,M,T,Q,S,P,S,T,L,S,A,S,V,G,D,R,V,T,I,T,C,R,A,S,Q,S,I,-,-,-,-,-,-,X,D,X,L,A,W,Y,Q,H,S,P,G,X,A,P,K,L,L,I,Y,R,A,-,-,-,-,-,-,-,S,R,L,E,S,G,V,P,-,S,R,F,S,G,S,G,-,-,S,G,T,E,F,T,L,T,I,S,S,L,Q,P,D,D,F,A,S,Y,Y,X,X,X,X,X,X,-,-,-,-,X,X,Q,T,F,G,Q,G,T,R,V,E,I,K
    1279079-->11695_LC,0,human,K,7.4e-57,181.9,0,106,human,IGKV1-503,0.91,IGKJ101,0.92,D,I,Q,M,T,Q,S,P,S,T,L,S,A,S,V,G,D,R,V,A,I,T,C,R,A,S,Q,N,V,-,-,-,-,-,-,X,W,X,A,W,F,Q,Q,K,P,G,K,A,P,N,X,L,I,Y,K,A,-,-,-,-,-,-,-,S,S,L,E,S,G,V,P,-,S,R,F,S,G,S,G,-,-,S,G,T,E,F,T,L,T,I,S,S,L,Q,P,D,D,F,A,T,Y,Y,C,X,X,X,X,X,-,-,-,-,X,S,R,T,F,G,Q,G,T,K,V,E,I,K

Diagnostic data

  • Jupyter extension version: 2024.11.0
  • Python extension version: 2024.22.1
  • .NET Install Tool for Extension Authors extension version: Not installed
  • Python package dependencies:
{
  "installed": {
    "pandas": "2.0.3",
    "pyarrow": "15.0.0"
  },
  "required": {
    "pandas": "1.2.0"
  },
  "unsatisfied": []
}
  • Entrypoint: Command
  • Active mode: dataWrangler

Extension version: 1.14.0
VS Code version: Code 1.96.2 (fabdb6a30b49f79a7aba0f2ad9df9b399473380f, 2024-12-19T10:22:47.216Z)
OS version: Windows_NT x64 10.0.26100
Modes:
Remote OS version: Linux x64 5.15.167.4-microsoft-standard-WSL2
Remote OS version: Linux x64 6.8.0-45-generic

System Info
Item Value
CPUs 11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz (16 x 2496)
GPU Status 2d_canvas: enabled
canvas_oop_rasterization: enabled_on
direct_rendering_display_compositor: disabled_off_ok
gpu_compositing: enabled
multiple_raster_threads: enabled_on
opengl: enabled_on
rasterization: enabled
raw_draw: disabled_off_ok
skia_graphite: disabled_off
video_decode: enabled
video_encode: enabled
vulkan: disabled_off
webgl: enabled
webgl2: enabled
webgpu: enabled
webnn: disabled_off
Load (avg) undefined
Memory (System) 39.68GB (20.17GB free)
Process Argv --folder-uri=vscode-remote://wsl+Ubuntu/home/holtcm/defining_pub_clone --remote=wsl+Ubuntu --crash-reporter-id 6cf2ec46-767c-408a-bb23-3db2cbbd1c7a
Screen Reader no
VM 0%
Item Value
Remote WSL: Ubuntu
OS Linux x64 5.15.167.4-microsoft-standard-WSL2
CPUs 11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz (16 x 0)
Memory (System) 19.38GB (17.58GB free)
VM 0%
Item Value
Remote SSH: 10.151.18.12
OS Linux x64 6.8.0-45-generic
CPUs Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz (96 x 1000)
Memory (System) 1006.53GB (958.20GB free)
VM 0%
A/B Experiments
vsliv368cf:30146710
vspor879:30202332
vspor708:30202333
vspor363:30204092
vscod805:30301674
binariesv615:30325510
vsaa593cf:30376535
py29gd2263:31024239
vscaac:30438847
c4g48928:30535728
azure-dev_surveyone:30548225
a9j8j154:30646983
962ge761:30959799
pythonnoceb:30805159
pythonmypyd1:30879173
h48ei257:31000450
pythontbext0:30879054
cppperfnew:31000557
dsvsc020:30976470
pythonait:31006305
dsvsc021:30996838
dvdeprecation:31068756
dwnewjupytercf:31046870
2f103344:31071589
nativerepl1:31139838
pythonrstrctxt:31112756
nativeloc2:31192216
cf971741:31144450
iacca1:31171482
notype1:31157159
5fd0e150:31155592
dwcopilot:31170013
stablechunks:31184530
6074i472:31201624

@pwang347
Copy link
Member

pwang347 commented Jan 3, 2025

Hi @Clint-Holt, thanks for opening this issue!

Seems like you have this experimental setting enabled, and may want to consider disabling it if it's causing problems:
Image

That being said, I noticed that setting sep=None for Pandas read_csv seems to correctly use , instead of h as the separator here for the example you provided. We are also using the same CSV sniffer module to infer the separator, so we'll need to look into why there is a discrepancy.

Thank you for letting us know!

@pwang347 pwang347 added the enhancement New feature or request label Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants