I have a csv file like that
Meme1, Meme2, Meme3, Meme4, Meme5, Meme6
Meme1, Meme2, Meme3, Meme99, Meme5, Meme6
Meme5, Meme2, Meme2, Meme4, Meme10, Meme6
Meme99, Meme3, Meme4, Meme4, Meme5, Meme6
I want like that
00000001, 00000010, 00000011, 00000100, 00000101, 00000110
00000001, 00000010, 01100011, 00000100, 00000101, 00000110
00000100, 00000010, 00000010, 00000100, 00001010, 00000110
means every integer should be converted to binary and word meme should be deleted
I am trying but cannot do:(
import pandas as pd
import csv
import numpy as np
dataset = pd.read_csv('datsetcoma.txt')
reader = csv.DictReader(dataset)
print (reader)
# print back the headers
for row in reader:
if row.is_integer:
b=np.binary_repr(10, width=8)
print (b)
CodePudding user response:
You can also try this:
import pandas as pd
import numpy as np
import io
# example taken from @ifly6
df = pd.read_csv(io.StringIO('''Meme1, Meme2, Meme3, Meme4, Meme5, Meme6
Meme1, Meme2, Meme3, Meme99, Meme5, Meme6
Meme5, Meme2, Meme2, Meme4, Meme10, Meme6
Meme99, Meme3, Meme4, Meme4, Meme5, Meme6'''), header=None)
df.apply(lambda x: x.apply(lambda y: bin(int(y.replace('Meme', '')))[2:].zfill(8) ) )
#output
0 1 2 3 4 5
0 00000001 00000010 00000011 00000100 00000101 00000110
1 00000001 00000010 00000011 01100011 00000101 00000110
2 00000101 00000010 00000010 00000100 00001010 00000110
3 01100011 00000011 00000100 00000100 00000101 00000110
CodePudding user response:
Loading the DF using import io
with no headers, I extract the integers using a regular expression without expansion. Then cast to integer types. Because np.binary_repr
is not vectorised, I have to "vectorise" it.
Because np
methods do not retain indexing, I then reproduce the indicies (which is needed to retain row and column positions that are preserved in the multi-index) in the pd.Series
constructor and unstack back to the original data frame shape.
df = pd.read_csv(io.StringIO('''Meme1, Meme2, Meme3, Meme4, Meme5, Meme6
Meme1, Meme2, Meme3, Meme99, Meme5, Meme6
Meme5, Meme2, Meme2, Meme4, Meme10, Meme6
Meme99, Meme3, Meme4, Meme4, Meme5, Meme6'''), header=None)
s = df.stack()
s = s.str.extract(r'(\d )', expand=False).astype(int)
pd.Series(np.vectorize(np.binary_repr)(s, width=8), index=s.index).unstack()
The final output,
0 1 2 3 4 5
0 00000001 00000010 00000011 00000100 00000101 00000110
1 00000001 00000010 00000011 01100011 00000101 00000110
2 00000101 00000010 00000010 00000100 00001010 00000110
3 01100011 00000011 00000100 00000100 00000101 00000110
Nb your binary conversions in the original post are not all accurate. Eg Meme5
is erroneously converted to 00000100
when it should be 00000101
. The OP version also omits (probably for convenience) the final row.
Nb also that this will not work if there are multiple expansion groups. In a comment I posited the hypothetical example foo123bar456
. This would result in two expansion groups which would disturb the indexing.